You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <jh...@apache.org> on 2016/02/04 09:35:53 UTC

Re: About Stream SQL

I totally agree with you. (Sorry for the delayed response; this week has been very busy.)

There is a tendency of vendors (and projects) to think that their technology is unique, and superior to everyone else’s, and want to showcase it in their dialect of SQL. That is natural, and it’s OK, since it makes them strive to make their technology better.

However, they have to remember that the end users don’t want something unique, they want something that solves their problem. They would like something that is standards compliant so that it is easy to learn, easy to hire developers for, and — if the worst comes to the worst — easy to migrate to a compatible competing technology.

I know the developers at Storm and Flink (and Samza too) and they understand the importance of collaborating on a standard.

I have been trying to play a dual role: supplying the parser and planner for streaming SQL, and also to facilitate the creation of a standard language and semantics of streaming SQL. For the latter, see Streaming page on Calcite’s web site[1]. On that page, I intend to illustrate all of the main patterns of streaming queries, give them names (e.g. “Tumbling windows”), and show how those translate into streaming SQL.

Also, it would be useful to create a reference implementation of streaming SQL in Calcite so that you can validate and run queries. The performance, scalability and reliability will not be the same as if you ran Storm, Flink or Samza, but at least you can see what the semantics should be.

I believe that most, if not all, of the examples that the projects are coming up with can be translated into SQL. It will be challenging, because we want to preserve the semantics of SQL, allow streaming SQL to interoperate with traditional relations, and also retain the general look and feel of SQL. (For example, I fought quite hard[2] recently for the principle that GROUP BY defines a partition (in the set-theory sense)[3] and therefore could not be used to represent a tumbling window, until I remembered that GROUPING SETS already allows each input row to appear in more than one output sub-total.)

What can you, the users, do? Get involved in the discussion about what you want in the language. Encourage the projects to bring their proposed SQL features into this forum for discussion, and add to the list of patterns and examples on the Streaming page. As in any standards process, the users help to keep the vendors focused.

I’ll be talking about streaming SQL, planning, and standardization at the Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please stop by.

Julian

[1] http://calcite.apache.org/docs/stream.html

[2] http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E

[3] https://en.wikipedia.org/wiki/Partition_of_a_set

[4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/

> On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <la...@huawei.com> wrote:
> 
> Hi to all,
> 
> I am from Huawei and am focusing on data stream processing.
> Recently I noticed that both in Storm community and Flink community there are endeavors to user Calcite as SQL parser to enable Storm/Flink to support SQL. They both want to supplemented or clarify Streaming SQL of calcite, especially the definition of windows.
> I am considering if both communities working on designing Stream SQL syntax separately, there would come out two different syntaxes which represent the same use case.
> Therefore, I am wondering if it is possible to unify such work, i.e. design and compliment the calcite Streaming SQL to enrich window definition so that both storm and flink can reuse the calcite(Streaming SQL) as their SQL parser for streaming cases with little change.
> What do you think about this idea?
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

Sorry I misunderstood. As for ways to tell the system that it can make
progress, the more the merrier. There's not a "best" mechanism. It
depends on the business problem. A good engine should support several,
including fully-ordered columns, punctuation, and slack, and let users
chose on a per-stream (per-topic) or even per-query basis.

On Tue, Feb 23, 2016 at 9:03 AM, Milinda Pathirage
<mp...@umail.iu.edu> wrote:
> Hi Julian,
>
> I agree with you. Calcite should stay away from physical properties of
> stream as much as possible. I was just trying to clarify the confusion
> regarding the punctuations and watermarks. My last question was not related
> to Calcite, rather to Flink and other implementations. Sorry for the
> confusion.
>
> Milinda
>
> On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <jh...@apache.org> wrote:
>
>> As the author of the streaming SQL specification, I don't care at all
>> how the system deduces that it is able to make progress. Just as the
>> authors of the SQL standard don't care whether a vendor chooses to
>> store records sorted and/or compressed.
>>
>> All the streaming SQL validator/optimizer needs to know is that, say,
>> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
>> is non-monotonic, so it can allow streaming aggregations on orderTime
>> and orderId, and disallow them on paymentMethod.
>>
>> This allows streaming engines to add novel mechanisms without having
>> to change the definition of streaming SQL.
>>
>>
>> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
>> <mp...@umail.iu.edu> wrote:
>> > Thank you Julian for the document.
>> >
>> > [1] is also a good read on punctuation. What I understood from reading
>> [1]
>> > and MillWheel paper is that a low-watermark (or row-time bound) is a
>> > property maintained by operators and operators derive low-watermark by
>> > processing punctuations.
>> >
>> > One other thing mentioned in MillWheel is the fact that Google's input
>> > streams contain punctuations to communicate stream progress. If
>> > punctuations are not there in the input stream we will have to generate
>> > them during ingest based on a slack or some similar technique. What do
>> you
>> > think about this?
>> >
>> > Thanks
>> > Milinda
>> >
>> >
>> > [1] http://www.vldb.org/pvldb/1/1453890.pdf
>> >
>> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <jh...@apache.org> wrote:
>> >
>> >> I’ve updated the Streaming reference guide as Fabian requested:
>> >> http://calcite.apache.org/docs/stream.html <
>> >> http://calcite.apache.org/docs/stream.html>
>> >>
>> >> Julian
>> >>
>> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
>> >> >
>> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
>> >> about the semantics of streaming SQL, and I cover some ground that I
>> don’t
>> >> cover in the streams page[1].
>> >> >
>> >> > The news item[2] gets you to both slides and video.
>> >> >
>> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous
>> SQL”.
>> >> If the examples[4] are accurate, all queries are heavily based on
>> sliding
>> >> windows, and they use a syntax for those windows that is very different
>> to
>> >> standard SQL.  I think we can deal with their use cases, and in my
>> opinion
>> >> our proposed syntax is more elegant and closer to the standard. But we
>> >> should discuss. I don’t want to diverge from other efforts because of
>> >> hubris/ignorance.
>> >> >
>> >> > At the Samza meetup some folks mentioned the use case of a stream that
>> >> summarizes, emitting periodic totals even if there were no data in a
>> given
>> >> period. Can they re-state that use case here, so we can discuss?
>> >> >
>> >> > Julian
>> >> >
>> >> > [1] http://calcite.apache.org/docs/stream.html
>> >> >
>> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
>> >> >
>> >> > [3]
>> >>
>> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
>> >> slide 29
>> >> >
>> >> > [4]
>> >>
>> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
>> >> >
>> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <
>> mpathira@umail.iu.edu>
>> >> wrote:
>> >> >>
>> >> >> Hi Fabian,
>> >> >>
>> >> >> We did some work on stream joins [1]. I tested stream-to-relation
>> joins
>> >> >> with Samza. But not stream-to-stream joins. But never updated the
>> >> streaming
>> >> >> documentation. I'll send a pull request with some documentation on
>> >> joins.
>> >> >>
>> >> >> Thanks
>> >> >> Milinda
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> >> >>
>> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> I agree, the Streaming page is a very good starting point for this
>> >> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
>> >> the
>> >> >>> page such that it reflects the current state of the discussion
>> (adding
>> >> HOP
>> >> >>> and TUMBLE functions, punctuations). I can also help with that,
>> e.g.,
>> >> by
>> >> >>> contributing figures, examples, or text, reviewing, or any other
>> way.
>> >> >>>
>> >> >>> From my point of view, the semantics of the window types and the
>> other
>> >> >>> operators in the Streaming document are very good. What is missing
>> are
>> >> >>> joins (windowed stream-stream joins, stream-table joins,
>> stream-table
>> >> joins
>> >> >>> with table updates) as already noted in the todo section.
>> >> >>>
>> >> >>> Regarding the handling of late-arriving events, I am not sure if
>> this
>> >> is a
>> >> >>> purely QoS issue as the result of a query might depend on the chosen
>> >> >>> strategy. Also users might want to pick different strategies for
>> >> different
>> >> >>> operations, so late-arriver strategies are not necessarily
>> end-to-end
>> >> but
>> >> >>> can be operator specific. However, I think these details should be
>> >> >>> discussed in a separate thread.
>> >> >>>
>> >> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>> >> >>> community.
>> >> >>> We are currently preparing our codebase and will start to work on
>> >> support
>> >> >>> for structured queries on streams in the next weeks. Flink will
>> >> support two
>> >> >>> query interface, a SQL interface and a LINQ-style Table API [1].
>> Both
>> >> will
>> >> >>> be optimized and translated by Calcite. As a first step, we want to
>> add
>> >> >>> simple stream transformations such as selection and projection to
>> both
>> >> >>> interfaces. Next, we will add windowing support (starting with
>> >> tumbling and
>> >> >>> hopping windows) to the Table API (as is said before, our plans here
>> >> are
>> >> >>> well aligned with Julian's suggestions). Once this is done, we would
>> >> extend
>> >> >>> the SQL interface to support windows which is hopefully as simple as
>> >> using
>> >> >>> a parser that accepts window syntax.
>> >> >>>
>> >> >>> So from our point of view, fixing the semantics of windows and
>> >> extending
>> >> >>> the optimizer accordingly is more urgent than agreeing on a syntax
>> >> >>> (although the Table API syntax could be inspired by Calcite's
>> StreamSQL
>> >> >>> syntax [2]). I can also help implementing the missing features in
>> >> Calcite.
>> >> >>>
>> >> >>> Having a reference implementation with tests would be awesome and
>> >> >>> definitely help.
>> >> >>>
>> >> >>> Best, Fabian
>> >> >>>
>> >> >>> [1]
>> >> >>>
>> >> >>>
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> >> >>> [2]
>> >> >>>
>> >> >>>
>> >>
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> >> >>>
>> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
>> >> >>>
>> >> >>>> Fabian,
>> >> >>>>
>> >> >>>> Apologies for the late reply.
>> >> >>>>
>> >> >>>> I would rather that the specification for streaming SQL was not too
>> >> >>>> prescriptive for how late events were handled. Approaches 1, 2 and
>> 3
>> >> are
>> >> >>>> all viable, and engines can differentiate themselves by the
>> strength
>> >> of
>> >> >>>> their support for this. But for the SQL to be considered valid, I
>> >> think
>> >> >>> the
>> >> >>>> validator just needs to know that it can make progress.
>> >> >>>>
>> >> >>>> There is a large area of functionality I’d call “quality of
>> service”
>> >> >>>> (QoS). This includes latency, reliability, at-least-once,
>> >> at-most-once or
>> >> >>>> in-order guarantees, as well as the late-row-handling this thread
>> is
>> >> >>>> concerned with. What the QoS metrics have in common is that they
>> are
>> >> >>>> end-to-end. To deliver a high QoS to the consumer, the producer
>> needs
>> >> to
>> >> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
>> >> >>> statement.
>> >> >>>> (Although you can ask what a SQL statement is able to deliver,
>> given
>> >> the
>> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
>> >> and in
>> >> >>>> my opinion this is the biggest reason to have a DSMS.
>> >> >>>>
>> >> >>>> For this reason, I would be inclined to put QoS constraints on the
>> >> stream
>> >> >>>> definition, not on the query. For example, taking latency as the
>> QoS
>> >> >>> metric
>> >> >>>> of interest, you could flag the Orders stream as “at most 10 ms
>> >> latency
>> >> >>>> between the record’s timestamp and the wall-clock time of the
>> server
>> >> >>>> receiving the records, and any records arriving after that time are
>> >> >>> logged
>> >> >>>> and discarded”, and that QoS constraint applies to both producers
>> and
>> >> >>>> consumers.
>> >> >>>>
>> >> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
>> >> “what
>> >> >>> is
>> >> >>>> the expected latency of Q?” or tell the planner “produce an
>> >> >>> implementation
>> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot
>> achieve
>> >> that
>> >> >>>> latency, fail”. You could even register Q in the system and tell
>> the
>> >> >>> system
>> >> >>>> to tighten up the latency of any upstream streams and the standing
>> >> >>> queries
>> >> >>>> that populate them. But it’s not valid to say “execute Q with a
>> >> latency
>> >> >>> of
>> >> >>>> 15 ms”: the system may not be able to achieve it.
>> >> >>>>
>> >> >>>> In summary: I would allow latency and late-row-handling and other
>> QoS
>> >> >>>> annotations in the query but it’s not the most natural or powerful
>> >> place
>> >> >>> to
>> >> >>>> put them.
>> >> >>>>
>> >> >>>> Julian
>> >> >>>>
>> >> >>>>
>> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com>
>> wrote:
>> >> >>>>>
>> >> >>>>> Excellent! I missed the punctuations in the todo list.
>> >> >>>>>
>> >> >>>>> What kind of strategies do you have in mind to handle events that
>> >> >>> arrive
>> >> >>>>> too late? I see
>> >> >>>>> 1. dropping of late events
>> >> >>>>> 2. computing an updated window result for each late arriving
>> >> >>>>> element (implies that the window state is stored for a certain
>> period
>> >> >>>>> before it is discarded)
>> >> >>>>> 3. computing a delta to the previous window result for each late
>> >> >>> arriving
>> >> >>>>> element (requires window state as well, not applicable to all
>> >> >>> aggregation
>> >> >>>>> types)
>> >> >>>>>
>> >> >>>>> It would be nice if strategies to handle late-arrivers could be
>> >> defined
>> >> >>>> in
>> >> >>>>> the query.
>> >> >>>>>
>> >> >>>>> I think the plans of the Flink community are quite well aligned
>> with
>> >> >>> your
>> >> >>>>> ideas for SQL on Streams.
>> >> >>>>> Should we start by updating / extending the Stream document on the
>> >> >>>> Calcite
>> >> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
>> >> >>>>> discussion of punctuations/watermarks/time bounds?
>> >> >>>>>
>> >> >>>>> Fabian
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
>> >> >>>>>
>> >> >>>>>> Let me rephrase: The *majority* of the literature, of which I
>> cited
>> >> >>>>>> just one example, calls them punctuation, and a couple of recent
>> >> >>>>>> papers out of Mountain View doesn't change that.
>> >> >>>>>>
>> >> >>>>>> There are some fine distinctions between punctuation, heartbeats,
>> >> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>> >> >>>>>> generated and propagated, that matter little when planning the
>> >> query.
>> >> >>>>>>
>> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
>> >> >>> wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has
>> "punctuation",
>> >> >>> which
>> >> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime
>> bound"
>> >> >>>>>>>> because it is feels more like a dynamic constraint than a
>> piece of
>> >> >>>>>>>> data, but the literature[1] calls them punctuation.)
>> >> >>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Some of the literature calls them punctuation, other literature
>> [1]
>> >> >>>> calls
>> >> >>>>>>> them watermarks.
>> >> >>>>>>>
>> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> >> >>>>>>
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Milinda Pathirage
>> >> >>
>> >> >> PhD Student | Research Assistant
>> >> >> School of Informatics and Computing | Data to Insight Center
>> >> >> Indiana University
>> >> >>
>> >> >> twitter: milindalakmal
>> >> >> skype: milinda.pathirage
>> >> >> blog: http://milinda.pathirage.org
>> >> >
>> >>
>> >>
>> >
>> >
>> > --
>> > Milinda Pathirage
>> >
>> > PhD Student | Research Assistant
>> > School of Informatics and Computing | Data to Insight Center
>> > Indiana University
>> >
>> > twitter: milindalakmal
>> > skype: milinda.pathirage
>> > blog: http://milinda.pathirage.org
>>
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Hi Julian,

I agree with you. Calcite should stay away from physical properties of
stream as much as possible. I was just trying to clarify the confusion
regarding the punctuations and watermarks. My last question was not related
to Calcite, rather to Flink and other implementations. Sorry for the
confusion.

Milinda

On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <jh...@apache.org> wrote:

> As the author of the streaming SQL specification, I don't care at all
> how the system deduces that it is able to make progress. Just as the
> authors of the SQL standard don't care whether a vendor chooses to
> store records sorted and/or compressed.
>
> All the streaming SQL validator/optimizer needs to know is that, say,
> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
> is non-monotonic, so it can allow streaming aggregations on orderTime
> and orderId, and disallow them on paymentMethod.
>
> This allows streaming engines to add novel mechanisms without having
> to change the definition of streaming SQL.
>
>
> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
> <mp...@umail.iu.edu> wrote:
> > Thank you Julian for the document.
> >
> > [1] is also a good read on punctuation. What I understood from reading
> [1]
> > and MillWheel paper is that a low-watermark (or row-time bound) is a
> > property maintained by operators and operators derive low-watermark by
> > processing punctuations.
> >
> > One other thing mentioned in MillWheel is the fact that Google's input
> > streams contain punctuations to communicate stream progress. If
> > punctuations are not there in the input stream we will have to generate
> > them during ingest based on a slack or some similar technique. What do
> you
> > think about this?
> >
> > Thanks
> > Milinda
> >
> >
> > [1] http://www.vldb.org/pvldb/1/1453890.pdf
> >
> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> I’ve updated the Streaming reference guide as Fabian requested:
> >> http://calcite.apache.org/docs/stream.html <
> >> http://calcite.apache.org/docs/stream.html>
> >>
> >> Julian
> >>
> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
> >> >
> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
> >> about the semantics of streaming SQL, and I cover some ground that I
> don’t
> >> cover in the streams page[1].
> >> >
> >> > The news item[2] gets you to both slides and video.
> >> >
> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous
> SQL”.
> >> If the examples[4] are accurate, all queries are heavily based on
> sliding
> >> windows, and they use a syntax for those windows that is very different
> to
> >> standard SQL.  I think we can deal with their use cases, and in my
> opinion
> >> our proposed syntax is more elegant and closer to the standard. But we
> >> should discuss. I don’t want to diverge from other efforts because of
> >> hubris/ignorance.
> >> >
> >> > At the Samza meetup some folks mentioned the use case of a stream that
> >> summarizes, emitting periodic totals even if there were no data in a
> given
> >> period. Can they re-state that use case here, so we can discuss?
> >> >
> >> > Julian
> >> >
> >> > [1] http://calcite.apache.org/docs/stream.html
> >> >
> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
> >> >
> >> > [3]
> >>
> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
> >> slide 29
> >> >
> >> > [4]
> >>
> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
> >> >
> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <
> mpathira@umail.iu.edu>
> >> wrote:
> >> >>
> >> >> Hi Fabian,
> >> >>
> >> >> We did some work on stream joins [1]. I tested stream-to-relation
> joins
> >> >> with Samza. But not stream-to-stream joins. But never updated the
> >> streaming
> >> >> documentation. I'll send a pull request with some documentation on
> >> joins.
> >> >>
> >> >> Thanks
> >> >> Milinda
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
> >> >>
> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com>
> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> I agree, the Streaming page is a very good starting point for this
> >> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
> >> the
> >> >>> page such that it reflects the current state of the discussion
> (adding
> >> HOP
> >> >>> and TUMBLE functions, punctuations). I can also help with that,
> e.g.,
> >> by
> >> >>> contributing figures, examples, or text, reviewing, or any other
> way.
> >> >>>
> >> >>> From my point of view, the semantics of the window types and the
> other
> >> >>> operators in the Streaming document are very good. What is missing
> are
> >> >>> joins (windowed stream-stream joins, stream-table joins,
> stream-table
> >> joins
> >> >>> with table updates) as already noted in the todo section.
> >> >>>
> >> >>> Regarding the handling of late-arriving events, I am not sure if
> this
> >> is a
> >> >>> purely QoS issue as the result of a query might depend on the chosen
> >> >>> strategy. Also users might want to pick different strategies for
> >> different
> >> >>> operations, so late-arriver strategies are not necessarily
> end-to-end
> >> but
> >> >>> can be operator specific. However, I think these details should be
> >> >>> discussed in a separate thread.
> >> >>>
> >> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
> >> >>> community.
> >> >>> We are currently preparing our codebase and will start to work on
> >> support
> >> >>> for structured queries on streams in the next weeks. Flink will
> >> support two
> >> >>> query interface, a SQL interface and a LINQ-style Table API [1].
> Both
> >> will
> >> >>> be optimized and translated by Calcite. As a first step, we want to
> add
> >> >>> simple stream transformations such as selection and projection to
> both
> >> >>> interfaces. Next, we will add windowing support (starting with
> >> tumbling and
> >> >>> hopping windows) to the Table API (as is said before, our plans here
> >> are
> >> >>> well aligned with Julian's suggestions). Once this is done, we would
> >> extend
> >> >>> the SQL interface to support windows which is hopefully as simple as
> >> using
> >> >>> a parser that accepts window syntax.
> >> >>>
> >> >>> So from our point of view, fixing the semantics of windows and
> >> extending
> >> >>> the optimizer accordingly is more urgent than agreeing on a syntax
> >> >>> (although the Table API syntax could be inspired by Calcite's
> StreamSQL
> >> >>> syntax [2]). I can also help implementing the missing features in
> >> Calcite.
> >> >>>
> >> >>> Having a reference implementation with tests would be awesome and
> >> >>> definitely help.
> >> >>>
> >> >>> Best, Fabian
> >> >>>
> >> >>> [1]
> >> >>>
> >> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
> >> >>> [2]
> >> >>>
> >> >>>
> >>
> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
> >> >>>
> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
> >> >>>
> >> >>>> Fabian,
> >> >>>>
> >> >>>> Apologies for the late reply.
> >> >>>>
> >> >>>> I would rather that the specification for streaming SQL was not too
> >> >>>> prescriptive for how late events were handled. Approaches 1, 2 and
> 3
> >> are
> >> >>>> all viable, and engines can differentiate themselves by the
> strength
> >> of
> >> >>>> their support for this. But for the SQL to be considered valid, I
> >> think
> >> >>> the
> >> >>>> validator just needs to know that it can make progress.
> >> >>>>
> >> >>>> There is a large area of functionality I’d call “quality of
> service”
> >> >>>> (QoS). This includes latency, reliability, at-least-once,
> >> at-most-once or
> >> >>>> in-order guarantees, as well as the late-row-handling this thread
> is
> >> >>>> concerned with. What the QoS metrics have in common is that they
> are
> >> >>>> end-to-end. To deliver a high QoS to the consumer, the producer
> needs
> >> to
> >> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
> >> >>> statement.
> >> >>>> (Although you can ask what a SQL statement is able to deliver,
> given
> >> the
> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
> >> and in
> >> >>>> my opinion this is the biggest reason to have a DSMS.
> >> >>>>
> >> >>>> For this reason, I would be inclined to put QoS constraints on the
> >> stream
> >> >>>> definition, not on the query. For example, taking latency as the
> QoS
> >> >>> metric
> >> >>>> of interest, you could flag the Orders stream as “at most 10 ms
> >> latency
> >> >>>> between the record’s timestamp and the wall-clock time of the
> server
> >> >>>> receiving the records, and any records arriving after that time are
> >> >>> logged
> >> >>>> and discarded”, and that QoS constraint applies to both producers
> and
> >> >>>> consumers.
> >> >>>>
> >> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
> >> “what
> >> >>> is
> >> >>>> the expected latency of Q?” or tell the planner “produce an
> >> >>> implementation
> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot
> achieve
> >> that
> >> >>>> latency, fail”. You could even register Q in the system and tell
> the
> >> >>> system
> >> >>>> to tighten up the latency of any upstream streams and the standing
> >> >>> queries
> >> >>>> that populate them. But it’s not valid to say “execute Q with a
> >> latency
> >> >>> of
> >> >>>> 15 ms”: the system may not be able to achieve it.
> >> >>>>
> >> >>>> In summary: I would allow latency and late-row-handling and other
> QoS
> >> >>>> annotations in the query but it’s not the most natural or powerful
> >> place
> >> >>> to
> >> >>>> put them.
> >> >>>>
> >> >>>> Julian
> >> >>>>
> >> >>>>
> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >> >>>>>
> >> >>>>> Excellent! I missed the punctuations in the todo list.
> >> >>>>>
> >> >>>>> What kind of strategies do you have in mind to handle events that
> >> >>> arrive
> >> >>>>> too late? I see
> >> >>>>> 1. dropping of late events
> >> >>>>> 2. computing an updated window result for each late arriving
> >> >>>>> element (implies that the window state is stored for a certain
> period
> >> >>>>> before it is discarded)
> >> >>>>> 3. computing a delta to the previous window result for each late
> >> >>> arriving
> >> >>>>> element (requires window state as well, not applicable to all
> >> >>> aggregation
> >> >>>>> types)
> >> >>>>>
> >> >>>>> It would be nice if strategies to handle late-arrivers could be
> >> defined
> >> >>>> in
> >> >>>>> the query.
> >> >>>>>
> >> >>>>> I think the plans of the Flink community are quite well aligned
> with
> >> >>> your
> >> >>>>> ideas for SQL on Streams.
> >> >>>>> Should we start by updating / extending the Stream document on the
> >> >>>> Calcite
> >> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
> >> >>>>> discussion of punctuations/watermarks/time bounds?
> >> >>>>>
> >> >>>>> Fabian
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> >> >>>>>
> >> >>>>>> Let me rephrase: The *majority* of the literature, of which I
> cited
> >> >>>>>> just one example, calls them punctuation, and a couple of recent
> >> >>>>>> papers out of Mountain View doesn't change that.
> >> >>>>>>
> >> >>>>>> There are some fine distinctions between punctuation, heartbeats,
> >> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
> >> >>>>>> generated and propagated, that matter little when planning the
> >> query.
> >> >>>>>>
> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >> >>>> wrote:
> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
> >> >>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has
> "punctuation",
> >> >>> which
> >> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime
> bound"
> >> >>>>>>>> because it is feels more like a dynamic constraint than a
> piece of
> >> >>>>>>>> data, but the literature[1] calls them punctuation.)
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> Some of the literature calls them punctuation, other literature
> [1]
> >> >>>> calls
> >> >>>>>>> them watermarks.
> >> >>>>>>>
> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >> >>>>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Milinda Pathirage
> >> >>
> >> >> PhD Student | Research Assistant
> >> >> School of Informatics and Computing | Data to Insight Center
> >> >> Indiana University
> >> >>
> >> >> twitter: milindalakmal
> >> >> skype: milinda.pathirage
> >> >> blog: http://milinda.pathirage.org
> >> >
> >>
> >>
> >
> >
> > --
> > Milinda Pathirage
> >
> > PhD Student | Research Assistant
> > School of Informatics and Computing | Data to Insight Center
> > Indiana University
> >
> > twitter: milindalakmal
> > skype: milinda.pathirage
> > blog: http://milinda.pathirage.org
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

As the author of the streaming SQL specification, I don't care at all
how the system deduces that it is able to make progress. Just as the
authors of the SQL standard don't care whether a vendor chooses to
store records sorted and/or compressed.

All the streaming SQL validator/optimizer needs to know is that, say,
orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
is non-monotonic, so it can allow streaming aggregations on orderTime
and orderId, and disallow them on paymentMethod.

This allows streaming engines to add novel mechanisms without having
to change the definition of streaming SQL.


On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
<mp...@umail.iu.edu> wrote:
> Thank you Julian for the document.
>
> [1] is also a good read on punctuation. What I understood from reading [1]
> and MillWheel paper is that a low-watermark (or row-time bound) is a
> property maintained by operators and operators derive low-watermark by
> processing punctuations.
>
> One other thing mentioned in MillWheel is the fact that Google's input
> streams contain punctuations to communicate stream progress. If
> punctuations are not there in the input stream we will have to generate
> them during ingest based on a slack or some similar technique. What do you
> think about this?
>
> Thanks
> Milinda
>
>
> [1] http://www.vldb.org/pvldb/1/1453890.pdf
>
> On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <jh...@apache.org> wrote:
>
>> I’ve updated the Streaming reference guide as Fabian requested:
>> http://calcite.apache.org/docs/stream.html <
>> http://calcite.apache.org/docs/stream.html>
>>
>> Julian
>>
>> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
>> >
>> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
>> about the semantics of streaming SQL, and I cover some ground that I don’t
>> cover in the streams page[1].
>> >
>> > The news item[2] gets you to both slides and video.
>> >
>> > In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”.
>> If the examples[4] are accurate, all queries are heavily based on sliding
>> windows, and they use a syntax for those windows that is very different to
>> standard SQL.  I think we can deal with their use cases, and in my opinion
>> our proposed syntax is more elegant and closer to the standard. But we
>> should discuss. I don’t want to diverge from other efforts because of
>> hubris/ignorance.
>> >
>> > At the Samza meetup some folks mentioned the use case of a stream that
>> summarizes, emitting periodic totals even if there were no data in a given
>> period. Can they re-state that use case here, so we can discuss?
>> >
>> > Julian
>> >
>> > [1] http://calcite.apache.org/docs/stream.html
>> >
>> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
>> >
>> > [3]
>> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
>> slide 29
>> >
>> > [4]
>> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
>> >
>> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mp...@umail.iu.edu>
>> wrote:
>> >>
>> >> Hi Fabian,
>> >>
>> >> We did some work on stream joins [1]. I tested stream-to-relation joins
>> >> with Samza. But not stream-to-stream joins. But never updated the
>> streaming
>> >> documentation. I'll send a pull request with some documentation on
>> joins.
>> >>
>> >> Thanks
>> >> Milinda
>> >>
>> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> >>
>> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I agree, the Streaming page is a very good starting point for this
>> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
>> the
>> >>> page such that it reflects the current state of the discussion (adding
>> HOP
>> >>> and TUMBLE functions, punctuations). I can also help with that, e.g.,
>> by
>> >>> contributing figures, examples, or text, reviewing, or any other way.
>> >>>
>> >>> From my point of view, the semantics of the window types and the other
>> >>> operators in the Streaming document are very good. What is missing are
>> >>> joins (windowed stream-stream joins, stream-table joins, stream-table
>> joins
>> >>> with table updates) as already noted in the todo section.
>> >>>
>> >>> Regarding the handling of late-arriving events, I am not sure if this
>> is a
>> >>> purely QoS issue as the result of a query might depend on the chosen
>> >>> strategy. Also users might want to pick different strategies for
>> different
>> >>> operations, so late-arriver strategies are not necessarily end-to-end
>> but
>> >>> can be operator specific. However, I think these details should be
>> >>> discussed in a separate thread.
>> >>>
>> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>> >>> community.
>> >>> We are currently preparing our codebase and will start to work on
>> support
>> >>> for structured queries on streams in the next weeks. Flink will
>> support two
>> >>> query interface, a SQL interface and a LINQ-style Table API [1]. Both
>> will
>> >>> be optimized and translated by Calcite. As a first step, we want to add
>> >>> simple stream transformations such as selection and projection to both
>> >>> interfaces. Next, we will add windowing support (starting with
>> tumbling and
>> >>> hopping windows) to the Table API (as is said before, our plans here
>> are
>> >>> well aligned with Julian's suggestions). Once this is done, we would
>> extend
>> >>> the SQL interface to support windows which is hopefully as simple as
>> using
>> >>> a parser that accepts window syntax.
>> >>>
>> >>> So from our point of view, fixing the semantics of windows and
>> extending
>> >>> the optimizer accordingly is more urgent than agreeing on a syntax
>> >>> (although the Table API syntax could be inspired by Calcite's StreamSQL
>> >>> syntax [2]). I can also help implementing the missing features in
>> Calcite.
>> >>>
>> >>> Having a reference implementation with tests would be awesome and
>> >>> definitely help.
>> >>>
>> >>> Best, Fabian
>> >>>
>> >>> [1]
>> >>>
>> >>>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> >>> [2]
>> >>>
>> >>>
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> >>>
>> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
>> >>>
>> >>>> Fabian,
>> >>>>
>> >>>> Apologies for the late reply.
>> >>>>
>> >>>> I would rather that the specification for streaming SQL was not too
>> >>>> prescriptive for how late events were handled. Approaches 1, 2 and 3
>> are
>> >>>> all viable, and engines can differentiate themselves by the strength
>> of
>> >>>> their support for this. But for the SQL to be considered valid, I
>> think
>> >>> the
>> >>>> validator just needs to know that it can make progress.
>> >>>>
>> >>>> There is a large area of functionality I’d call “quality of service”
>> >>>> (QoS). This includes latency, reliability, at-least-once,
>> at-most-once or
>> >>>> in-order guarantees, as well as the late-row-handling this thread is
>> >>>> concerned with. What the QoS metrics have in common is that they are
>> >>>> end-to-end. To deliver a high QoS to the consumer, the producer needs
>> to
>> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
>> >>> statement.
>> >>>> (Although you can ask what a SQL statement is able to deliver, given
>> the
>> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
>> and in
>> >>>> my opinion this is the biggest reason to have a DSMS.
>> >>>>
>> >>>> For this reason, I would be inclined to put QoS constraints on the
>> stream
>> >>>> definition, not on the query. For example, taking latency as the QoS
>> >>> metric
>> >>>> of interest, you could flag the Orders stream as “at most 10 ms
>> latency
>> >>>> between the record’s timestamp and the wall-clock time of the server
>> >>>> receiving the records, and any records arriving after that time are
>> >>> logged
>> >>>> and discarded”, and that QoS constraint applies to both producers and
>> >>>> consumers.
>> >>>>
>> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
>> “what
>> >>> is
>> >>>> the expected latency of Q?” or tell the planner “produce an
>> >>> implementation
>> >>>> of Q with a latency of no more than 15 ms, and if you cannot achieve
>> that
>> >>>> latency, fail”. You could even register Q in the system and tell the
>> >>> system
>> >>>> to tighten up the latency of any upstream streams and the standing
>> >>> queries
>> >>>> that populate them. But it’s not valid to say “execute Q with a
>> latency
>> >>> of
>> >>>> 15 ms”: the system may not be able to achieve it.
>> >>>>
>> >>>> In summary: I would allow latency and late-row-handling and other QoS
>> >>>> annotations in the query but it’s not the most natural or powerful
>> place
>> >>> to
>> >>>> put them.
>> >>>>
>> >>>> Julian
>> >>>>
>> >>>>
>> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
>> >>>>>
>> >>>>> Excellent! I missed the punctuations in the todo list.
>> >>>>>
>> >>>>> What kind of strategies do you have in mind to handle events that
>> >>> arrive
>> >>>>> too late? I see
>> >>>>> 1. dropping of late events
>> >>>>> 2. computing an updated window result for each late arriving
>> >>>>> element (implies that the window state is stored for a certain period
>> >>>>> before it is discarded)
>> >>>>> 3. computing a delta to the previous window result for each late
>> >>> arriving
>> >>>>> element (requires window state as well, not applicable to all
>> >>> aggregation
>> >>>>> types)
>> >>>>>
>> >>>>> It would be nice if strategies to handle late-arrivers could be
>> defined
>> >>>> in
>> >>>>> the query.
>> >>>>>
>> >>>>> I think the plans of the Flink community are quite well aligned with
>> >>> your
>> >>>>> ideas for SQL on Streams.
>> >>>>> Should we start by updating / extending the Stream document on the
>> >>>> Calcite
>> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
>> >>>>> discussion of punctuations/watermarks/time bounds?
>> >>>>>
>> >>>>> Fabian
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
>> >>>>>
>> >>>>>> Let me rephrase: The *majority* of the literature, of which I cited
>> >>>>>> just one example, calls them punctuation, and a couple of recent
>> >>>>>> papers out of Mountain View doesn't change that.
>> >>>>>>
>> >>>>>> There are some fine distinctions between punctuation, heartbeats,
>> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>> >>>>>> generated and propagated, that matter little when planning the
>> query.
>> >>>>>>
>> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
>> >>>> wrote:
>> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
>> >>> wrote:
>> >>>>>>>
>> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
>> >>> which
>> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>> >>>>>>>> because it is feels more like a dynamic constraint than a piece of
>> >>>>>>>> data, but the literature[1] calls them punctuation.)
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> Some of the literature calls them punctuation, other literature [1]
>> >>>> calls
>> >>>>>>> them watermarks.
>> >>>>>>>
>> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> >>>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Milinda Pathirage
>> >>
>> >> PhD Student | Research Assistant
>> >> School of Informatics and Computing | Data to Insight Center
>> >> Indiana University
>> >>
>> >> twitter: milindalakmal
>> >> skype: milinda.pathirage
>> >> blog: http://milinda.pathirage.org
>> >
>>
>>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Thank you Julian for the document.

[1] is also a good read on punctuation. What I understood from reading [1]
and MillWheel paper is that a low-watermark (or row-time bound) is a
property maintained by operators and operators derive low-watermark by
processing punctuations.

One other thing mentioned in MillWheel is the fact that Google's input
streams contain punctuations to communicate stream progress. If
punctuations are not there in the input stream we will have to generate
them during ingest based on a slack or some similar technique. What do you
think about this?

Thanks
Milinda


[1] http://www.vldb.org/pvldb/1/1453890.pdf

On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <jh...@apache.org> wrote:

> I’ve updated the Streaming reference guide as Fabian requested:
> http://calcite.apache.org/docs/stream.html <
> http://calcite.apache.org/docs/stream.html>
>
> Julian
>
> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
> about the semantics of streaming SQL, and I cover some ground that I don’t
> cover in the streams page[1].
> >
> > The news item[2] gets you to both slides and video.
> >
> > In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”.
> If the examples[4] are accurate, all queries are heavily based on sliding
> windows, and they use a syntax for those windows that is very different to
> standard SQL.  I think we can deal with their use cases, and in my opinion
> our proposed syntax is more elegant and closer to the standard. But we
> should discuss. I don’t want to diverge from other efforts because of
> hubris/ignorance.
> >
> > At the Samza meetup some folks mentioned the use case of a stream that
> summarizes, emitting periodic totals even if there were no data in a given
> period. Can they re-state that use case here, so we can discuss?
> >
> > Julian
> >
> > [1] http://calcite.apache.org/docs/stream.html
> >
> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
> >
> > [3]
> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
> slide 29
> >
> > [4]
> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
> >
> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mp...@umail.iu.edu>
> wrote:
> >>
> >> Hi Fabian,
> >>
> >> We did some work on stream joins [1]. I tested stream-to-relation joins
> >> with Samza. But not stream-to-stream joins. But never updated the
> streaming
> >> documentation. I'll send a pull request with some documentation on
> joins.
> >>
> >> Thanks
> >> Milinda
> >>
> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
> >>
> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I agree, the Streaming page is a very good starting point for this
> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
> the
> >>> page such that it reflects the current state of the discussion (adding
> HOP
> >>> and TUMBLE functions, punctuations). I can also help with that, e.g.,
> by
> >>> contributing figures, examples, or text, reviewing, or any other way.
> >>>
> >>> From my point of view, the semantics of the window types and the other
> >>> operators in the Streaming document are very good. What is missing are
> >>> joins (windowed stream-stream joins, stream-table joins, stream-table
> joins
> >>> with table updates) as already noted in the todo section.
> >>>
> >>> Regarding the handling of late-arriving events, I am not sure if this
> is a
> >>> purely QoS issue as the result of a query might depend on the chosen
> >>> strategy. Also users might want to pick different strategies for
> different
> >>> operations, so late-arriver strategies are not necessarily end-to-end
> but
> >>> can be operator specific. However, I think these details should be
> >>> discussed in a separate thread.
> >>>
> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
> >>> community.
> >>> We are currently preparing our codebase and will start to work on
> support
> >>> for structured queries on streams in the next weeks. Flink will
> support two
> >>> query interface, a SQL interface and a LINQ-style Table API [1]. Both
> will
> >>> be optimized and translated by Calcite. As a first step, we want to add
> >>> simple stream transformations such as selection and projection to both
> >>> interfaces. Next, we will add windowing support (starting with
> tumbling and
> >>> hopping windows) to the Table API (as is said before, our plans here
> are
> >>> well aligned with Julian's suggestions). Once this is done, we would
> extend
> >>> the SQL interface to support windows which is hopefully as simple as
> using
> >>> a parser that accepts window syntax.
> >>>
> >>> So from our point of view, fixing the semantics of windows and
> extending
> >>> the optimizer accordingly is more urgent than agreeing on a syntax
> >>> (although the Table API syntax could be inspired by Calcite's StreamSQL
> >>> syntax [2]). I can also help implementing the missing features in
> Calcite.
> >>>
> >>> Having a reference implementation with tests would be awesome and
> >>> definitely help.
> >>>
> >>> Best, Fabian
> >>>
> >>> [1]
> >>>
> >>>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
> >>> [2]
> >>>
> >>>
> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
> >>>
> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
> >>>
> >>>> Fabian,
> >>>>
> >>>> Apologies for the late reply.
> >>>>
> >>>> I would rather that the specification for streaming SQL was not too
> >>>> prescriptive for how late events were handled. Approaches 1, 2 and 3
> are
> >>>> all viable, and engines can differentiate themselves by the strength
> of
> >>>> their support for this. But for the SQL to be considered valid, I
> think
> >>> the
> >>>> validator just needs to know that it can make progress.
> >>>>
> >>>> There is a large area of functionality I’d call “quality of service”
> >>>> (QoS). This includes latency, reliability, at-least-once,
> at-most-once or
> >>>> in-order guarantees, as well as the late-row-handling this thread is
> >>>> concerned with. What the QoS metrics have in common is that they are
> >>>> end-to-end. To deliver a high QoS to the consumer, the producer needs
> to
> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
> >>> statement.
> >>>> (Although you can ask what a SQL statement is able to deliver, given
> the
> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
> and in
> >>>> my opinion this is the biggest reason to have a DSMS.
> >>>>
> >>>> For this reason, I would be inclined to put QoS constraints on the
> stream
> >>>> definition, not on the query. For example, taking latency as the QoS
> >>> metric
> >>>> of interest, you could flag the Orders stream as “at most 10 ms
> latency
> >>>> between the record’s timestamp and the wall-clock time of the server
> >>>> receiving the records, and any records arriving after that time are
> >>> logged
> >>>> and discarded”, and that QoS constraint applies to both producers and
> >>>> consumers.
> >>>>
> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
> “what
> >>> is
> >>>> the expected latency of Q?” or tell the planner “produce an
> >>> implementation
> >>>> of Q with a latency of no more than 15 ms, and if you cannot achieve
> that
> >>>> latency, fail”. You could even register Q in the system and tell the
> >>> system
> >>>> to tighten up the latency of any upstream streams and the standing
> >>> queries
> >>>> that populate them. But it’s not valid to say “execute Q with a
> latency
> >>> of
> >>>> 15 ms”: the system may not be able to achieve it.
> >>>>
> >>>> In summary: I would allow latency and late-row-handling and other QoS
> >>>> annotations in the query but it’s not the most natural or powerful
> place
> >>> to
> >>>> put them.
> >>>>
> >>>> Julian
> >>>>
> >>>>
> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
> >>>>>
> >>>>> Excellent! I missed the punctuations in the todo list.
> >>>>>
> >>>>> What kind of strategies do you have in mind to handle events that
> >>> arrive
> >>>>> too late? I see
> >>>>> 1. dropping of late events
> >>>>> 2. computing an updated window result for each late arriving
> >>>>> element (implies that the window state is stored for a certain period
> >>>>> before it is discarded)
> >>>>> 3. computing a delta to the previous window result for each late
> >>> arriving
> >>>>> element (requires window state as well, not applicable to all
> >>> aggregation
> >>>>> types)
> >>>>>
> >>>>> It would be nice if strategies to handle late-arrivers could be
> defined
> >>>> in
> >>>>> the query.
> >>>>>
> >>>>> I think the plans of the Flink community are quite well aligned with
> >>> your
> >>>>> ideas for SQL on Streams.
> >>>>> Should we start by updating / extending the Stream document on the
> >>>> Calcite
> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
> >>>>> discussion of punctuations/watermarks/time bounds?
> >>>>>
> >>>>> Fabian
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> >>>>>
> >>>>>> Let me rephrase: The *majority* of the literature, of which I cited
> >>>>>> just one example, calls them punctuation, and a couple of recent
> >>>>>> papers out of Mountain View doesn't change that.
> >>>>>>
> >>>>>> There are some fine distinctions between punctuation, heartbeats,
> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
> >>>>>> generated and propagated, that matter little when planning the
> query.
> >>>>>>
> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
> >>>> wrote:
> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
> >>> wrote:
> >>>>>>>
> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
> >>> which
> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
> >>>>>>>> because it is feels more like a dynamic constraint than a piece of
> >>>>>>>> data, but the literature[1] calls them punctuation.)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Some of the literature calls them punctuation, other literature [1]
> >>>> calls
> >>>>>>> them watermarks.
> >>>>>>>
> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Milinda Pathirage
> >>
> >> PhD Student | Research Assistant
> >> School of Informatics and Computing | Data to Insight Center
> >> Indiana University
> >>
> >> twitter: milindalakmal
> >> skype: milinda.pathirage
> >> blog: http://milinda.pathirage.org
> >
>
>


-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

I’ve updated the Streaming reference guide as Fabian requested: http://calcite.apache.org/docs/stream.html <http://calcite.apache.org/docs/stream.html>

Julian

> On Feb 19, 2016, at 3:11 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the semantics of streaming SQL, and I cover some ground that I don’t cover in the streams page[1].
> 
> The news item[2] gets you to both slides and video.
> 
> In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If the examples[4] are accurate, all queries are heavily based on sliding windows, and they use a syntax for those windows that is very different to standard SQL.  I think we can deal with their use cases, and in my opinion our proposed syntax is more elegant and closer to the standard. But we should discuss. I don’t want to diverge from other efforts because of hubris/ignorance.
> 
> At the Samza meetup some folks mentioned the use case of a stream that summarizes, emitting periodic totals even if there were no data in a given period. Can they re-state that use case here, so we can discuss?
> 
> Julian
> 
> [1] http://calcite.apache.org/docs/stream.html
> 
> [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
> 
> [3] http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 slide 29
> 
> [4] https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf 
> 
>> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mp...@umail.iu.edu> wrote:
>> 
>> Hi Fabian,
>> 
>> We did some work on stream joins [1]. I tested stream-to-relation joins
>> with Samza. But not stream-to-stream joins. But never updated the streaming
>> documentation. I'll send a pull request with some documentation on joins.
>> 
>> Thanks
>> Milinda
>> 
>> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> 
>> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I agree, the Streaming page is a very good starting point for this
>>> discussion. As suggested by Julian, I created CALCITE-1090 to update the
>>> page such that it reflects the current state of the discussion (adding HOP
>>> and TUMBLE functions, punctuations). I can also help with that, e.g., by
>>> contributing figures, examples, or text, reviewing, or any other way.
>>> 
>>> From my point of view, the semantics of the window types and the other
>>> operators in the Streaming document are very good. What is missing are
>>> joins (windowed stream-stream joins, stream-table joins, stream-table joins
>>> with table updates) as already noted in the todo section.
>>> 
>>> Regarding the handling of late-arriving events, I am not sure if this is a
>>> purely QoS issue as the result of a query might depend on the chosen
>>> strategy. Also users might want to pick different strategies for different
>>> operations, so late-arriver strategies are not necessarily end-to-end but
>>> can be operator specific. However, I think these details should be
>>> discussed in a separate thread.
>>> 
>>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>>> community.
>>> We are currently preparing our codebase and will start to work on support
>>> for structured queries on streams in the next weeks. Flink will support two
>>> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
>>> be optimized and translated by Calcite. As a first step, we want to add
>>> simple stream transformations such as selection and projection to both
>>> interfaces. Next, we will add windowing support (starting with tumbling and
>>> hopping windows) to the Table API (as is said before, our plans here are
>>> well aligned with Julian's suggestions). Once this is done, we would extend
>>> the SQL interface to support windows which is hopefully as simple as using
>>> a parser that accepts window syntax.
>>> 
>>> So from our point of view, fixing the semantics of windows and extending
>>> the optimizer accordingly is more urgent than agreeing on a syntax
>>> (although the Table API syntax could be inspired by Calcite's StreamSQL
>>> syntax [2]). I can also help implementing the missing features in Calcite.
>>> 
>>> Having a reference implementation with tests would be awesome and
>>> definitely help.
>>> 
>>> Best, Fabian
>>> 
>>> [1]
>>> 
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>>> [2]
>>> 
>>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>>> 
>>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
>>> 
>>>> Fabian,
>>>> 
>>>> Apologies for the late reply.
>>>> 
>>>> I would rather that the specification for streaming SQL was not too
>>>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
>>>> all viable, and engines can differentiate themselves by the strength of
>>>> their support for this. But for the SQL to be considered valid, I think
>>> the
>>>> validator just needs to know that it can make progress.
>>>> 
>>>> There is a large area of functionality I’d call “quality of service”
>>>> (QoS). This includes latency, reliability, at-least-once, at-most-once or
>>>> in-order guarantees, as well as the late-row-handling this thread is
>>>> concerned with. What the QoS metrics have in common is that they are
>>>> end-to-end. To deliver a high QoS to the consumer, the producer needs to
>>>> conform to a high QoS. The QoS is beyond the control of the SQL
>>> statement.
>>>> (Although you can ask what a SQL statement is able to deliver, given the
>>>> upstream QoS guarantees.) QoS is best managed by the whole system, and in
>>>> my opinion this is the biggest reason to have a DSMS.
>>>> 
>>>> For this reason, I would be inclined to put QoS constraints on the stream
>>>> definition, not on the query. For example, taking latency as the QoS
>>> metric
>>>> of interest, you could flag the Orders stream as “at most 10 ms latency
>>>> between the record’s timestamp and the wall-clock time of the server
>>>> receiving the records, and any records arriving after that time are
>>> logged
>>>> and discarded”, and that QoS constraint applies to both producers and
>>>> consumers.
>>>> 
>>>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what
>>> is
>>>> the expected latency of Q?” or tell the planner “produce an
>>> implementation
>>>> of Q with a latency of no more than 15 ms, and if you cannot achieve that
>>>> latency, fail”. You could even register Q in the system and tell the
>>> system
>>>> to tighten up the latency of any upstream streams and the standing
>>> queries
>>>> that populate them. But it’s not valid to say “execute Q with a latency
>>> of
>>>> 15 ms”: the system may not be able to achieve it.
>>>> 
>>>> In summary: I would allow latency and late-row-handling and other QoS
>>>> annotations in the query but it’s not the most natural or powerful place
>>> to
>>>> put them.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
>>>>> 
>>>>> Excellent! I missed the punctuations in the todo list.
>>>>> 
>>>>> What kind of strategies do you have in mind to handle events that
>>> arrive
>>>>> too late? I see
>>>>> 1. dropping of late events
>>>>> 2. computing an updated window result for each late arriving
>>>>> element (implies that the window state is stored for a certain period
>>>>> before it is discarded)
>>>>> 3. computing a delta to the previous window result for each late
>>> arriving
>>>>> element (requires window state as well, not applicable to all
>>> aggregation
>>>>> types)
>>>>> 
>>>>> It would be nice if strategies to handle late-arrivers could be defined
>>>> in
>>>>> the query.
>>>>> 
>>>>> I think the plans of the Flink community are quite well aligned with
>>> your
>>>>> ideas for SQL on Streams.
>>>>> Should we start by updating / extending the Stream document on the
>>>> Calcite
>>>>> website to include the new window definitions (TUMBLE, HOP) and a
>>>>> discussion of punctuations/watermarks/time bounds?
>>>>> 
>>>>> Fabian
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
>>>>> 
>>>>>> Let me rephrase: The *majority* of the literature, of which I cited
>>>>>> just one example, calls them punctuation, and a couple of recent
>>>>>> papers out of Mountain View doesn't change that.
>>>>>> 
>>>>>> There are some fine distinctions between punctuation, heartbeats,
>>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>>>>>> generated and propagated, that matter little when planning the query.
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
>>> wrote:
>>>>>>> 
>>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
>>> which
>>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>>>>>>>> because it is feels more like a dynamic constraint than a piece of
>>>>>>>> data, but the literature[1] calls them punctuation.)
>>>>>>>> 
>>>>>>> 
>>>>>>> Some of the literature calls them punctuation, other literature [1]
>>>> calls
>>>>>>> them watermarks.
>>>>>>> 
>>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Milinda Pathirage
>> 
>> PhD Student | Research Assistant
>> School of Informatics and Computing | Data to Insight Center
>> Indiana University
>> 
>> twitter: milindalakmal
>> skype: milinda.pathirage
>> blog: http://milinda.pathirage.org
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the semantics of streaming SQL, and I cover some ground that I don’t cover in the streams page[1].

The news item[2] gets you to both slides and video.

In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If the examples[4] are accurate, all queries are heavily based on sliding windows, and they use a syntax for those windows that is very different to standard SQL.  I think we can deal with their use cases, and in my opinion our proposed syntax is more elegant and closer to the standard. But we should discuss. I don’t want to diverge from other efforts because of hubris/ignorance.

At the Samza meetup some folks mentioned the use case of a stream that summarizes, emitting periodic totals even if there were no data in a given period. Can they re-state that use case here, so we can discuss?

Julian

[1] http://calcite.apache.org/docs/stream.html

[2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/

[3] http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 slide 29

[4] https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf 

> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mp...@umail.iu.edu> wrote:
> 
> Hi Fabian,
> 
> We did some work on stream joins [1]. I tested stream-to-relation joins
> with Samza. But not stream-to-stream joins. But never updated the streaming
> documentation. I'll send a pull request with some documentation on joins.
> 
> Thanks
> Milinda
> 
> [1] https://issues.apache.org/jira/browse/CALCITE-968
> 
> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I agree, the Streaming page is a very good starting point for this
>> discussion. As suggested by Julian, I created CALCITE-1090 to update the
>> page such that it reflects the current state of the discussion (adding HOP
>> and TUMBLE functions, punctuations). I can also help with that, e.g., by
>> contributing figures, examples, or text, reviewing, or any other way.
>> 
>> From my point of view, the semantics of the window types and the other
>> operators in the Streaming document are very good. What is missing are
>> joins (windowed stream-stream joins, stream-table joins, stream-table joins
>> with table updates) as already noted in the todo section.
>> 
>> Regarding the handling of late-arriving events, I am not sure if this is a
>> purely QoS issue as the result of a query might depend on the chosen
>> strategy. Also users might want to pick different strategies for different
>> operations, so late-arriver strategies are not necessarily end-to-end but
>> can be operator specific. However, I think these details should be
>> discussed in a separate thread.
>> 
>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>> community.
>> We are currently preparing our codebase and will start to work on support
>> for structured queries on streams in the next weeks. Flink will support two
>> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
>> be optimized and translated by Calcite. As a first step, we want to add
>> simple stream transformations such as selection and projection to both
>> interfaces. Next, we will add windowing support (starting with tumbling and
>> hopping windows) to the Table API (as is said before, our plans here are
>> well aligned with Julian's suggestions). Once this is done, we would extend
>> the SQL interface to support windows which is hopefully as simple as using
>> a parser that accepts window syntax.
>> 
>> So from our point of view, fixing the semantics of windows and extending
>> the optimizer accordingly is more urgent than agreeing on a syntax
>> (although the Table API syntax could be inspired by Calcite's StreamSQL
>> syntax [2]). I can also help implementing the missing features in Calcite.
>> 
>> Having a reference implementation with tests would be awesome and
>> definitely help.
>> 
>> Best, Fabian
>> 
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> [2]
>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> 
>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
>> 
>>> Fabian,
>>> 
>>> Apologies for the late reply.
>>> 
>>> I would rather that the specification for streaming SQL was not too
>>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
>>> all viable, and engines can differentiate themselves by the strength of
>>> their support for this. But for the SQL to be considered valid, I think
>> the
>>> validator just needs to know that it can make progress.
>>> 
>>> There is a large area of functionality I’d call “quality of service”
>>> (QoS). This includes latency, reliability, at-least-once, at-most-once or
>>> in-order guarantees, as well as the late-row-handling this thread is
>>> concerned with. What the QoS metrics have in common is that they are
>>> end-to-end. To deliver a high QoS to the consumer, the producer needs to
>>> conform to a high QoS. The QoS is beyond the control of the SQL
>> statement.
>>> (Although you can ask what a SQL statement is able to deliver, given the
>>> upstream QoS guarantees.) QoS is best managed by the whole system, and in
>>> my opinion this is the biggest reason to have a DSMS.
>>> 
>>> For this reason, I would be inclined to put QoS constraints on the stream
>>> definition, not on the query. For example, taking latency as the QoS
>> metric
>>> of interest, you could flag the Orders stream as “at most 10 ms latency
>>> between the record’s timestamp and the wall-clock time of the server
>>> receiving the records, and any records arriving after that time are
>> logged
>>> and discarded”, and that QoS constraint applies to both producers and
>>> consumers.
>>> 
>>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what
>> is
>>> the expected latency of Q?” or tell the planner “produce an
>> implementation
>>> of Q with a latency of no more than 15 ms, and if you cannot achieve that
>>> latency, fail”. You could even register Q in the system and tell the
>> system
>>> to tighten up the latency of any upstream streams and the standing
>> queries
>>> that populate them. But it’s not valid to say “execute Q with a latency
>> of
>>> 15 ms”: the system may not be able to achieve it.
>>> 
>>> In summary: I would allow latency and late-row-handling and other QoS
>>> annotations in the query but it’s not the most natural or powerful place
>> to
>>> put them.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
>>>> 
>>>> Excellent! I missed the punctuations in the todo list.
>>>> 
>>>> What kind of strategies do you have in mind to handle events that
>> arrive
>>>> too late? I see
>>>> 1. dropping of late events
>>>> 2. computing an updated window result for each late arriving
>>>> element (implies that the window state is stored for a certain period
>>>> before it is discarded)
>>>> 3. computing a delta to the previous window result for each late
>> arriving
>>>> element (requires window state as well, not applicable to all
>> aggregation
>>>> types)
>>>> 
>>>> It would be nice if strategies to handle late-arrivers could be defined
>>> in
>>>> the query.
>>>> 
>>>> I think the plans of the Flink community are quite well aligned with
>> your
>>>> ideas for SQL on Streams.
>>>> Should we start by updating / extending the Stream document on the
>>> Calcite
>>>> website to include the new window definitions (TUMBLE, HOP) and a
>>>> discussion of punctuations/watermarks/time bounds?
>>>> 
>>>> Fabian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
>>>> 
>>>>> Let me rephrase: The *majority* of the literature, of which I cited
>>>>> just one example, calls them punctuation, and a couple of recent
>>>>> papers out of Mountain View doesn't change that.
>>>>> 
>>>>> There are some fine distinctions between punctuation, heartbeats,
>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>>>>> generated and propagated, that matter little when planning the query.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
>> wrote:
>>>>>> 
>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
>> which
>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>>>>>>> because it is feels more like a dynamic constraint than a piece of
>>>>>>> data, but the literature[1] calls them punctuation.)
>>>>>>> 
>>>>>> 
>>>>>> Some of the literature calls them punctuation, other literature [1]
>>> calls
>>>>>> them watermarks.
>>>>>> 
>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Hi Fabian,

We did some work on stream joins [1]. I tested stream-to-relation joins
with Samza. But not stream-to-stream joins. But never updated the streaming
documentation. I'll send a pull request with some documentation on joins.

Thanks
Milinda

[1] https://issues.apache.org/jira/browse/CALCITE-968

On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> I agree, the Streaming page is a very good starting point for this
> discussion. As suggested by Julian, I created CALCITE-1090 to update the
> page such that it reflects the current state of the discussion (adding HOP
> and TUMBLE functions, punctuations). I can also help with that, e.g., by
> contributing figures, examples, or text, reviewing, or any other way.
>
> From my point of view, the semantics of the window types and the other
> operators in the Streaming document are very good. What is missing are
> joins (windowed stream-stream joins, stream-table joins, stream-table joins
> with table updates) as already noted in the todo section.
>
> Regarding the handling of late-arriving events, I am not sure if this is a
> purely QoS issue as the result of a query might depend on the chosen
> strategy. Also users might want to pick different strategies for different
> operations, so late-arriver strategies are not necessarily end-to-end but
> can be operator specific. However, I think these details should be
> discussed in a separate thread.
>
> I'd like to add a few words about the StreamSQL roadmap of the Flink
> community.
> We are currently preparing our codebase and will start to work on support
> for structured queries on streams in the next weeks. Flink will support two
> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
> be optimized and translated by Calcite. As a first step, we want to add
> simple stream transformations such as selection and projection to both
> interfaces. Next, we will add windowing support (starting with tumbling and
> hopping windows) to the Table API (as is said before, our plans here are
> well aligned with Julian's suggestions). Once this is done, we would extend
> the SQL interface to support windows which is hopefully as simple as using
> a parser that accepts window syntax.
>
> So from our point of view, fixing the semantics of windows and extending
> the optimizer accordingly is more urgent than agreeing on a syntax
> (although the Table API syntax could be inspired by Calcite's StreamSQL
> syntax [2]). I can also help implementing the missing features in Calcite.
>
> Having a reference implementation with tests would be awesome and
> definitely help.
>
> Best, Fabian
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
> [2]
>
> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>
> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:
>
> > Fabian,
> >
> > Apologies for the late reply.
> >
> > I would rather that the specification for streaming SQL was not too
> > prescriptive for how late events were handled. Approaches 1, 2 and 3 are
> > all viable, and engines can differentiate themselves by the strength of
> > their support for this. But for the SQL to be considered valid, I think
> the
> > validator just needs to know that it can make progress.
> >
> > There is a large area of functionality I’d call “quality of service”
> > (QoS). This includes latency, reliability, at-least-once, at-most-once or
> > in-order guarantees, as well as the late-row-handling this thread is
> > concerned with. What the QoS metrics have in common is that they are
> > end-to-end. To deliver a high QoS to the consumer, the producer needs to
> > conform to a high QoS. The QoS is beyond the control of the SQL
> statement.
> > (Although you can ask what a SQL statement is able to deliver, given the
> > upstream QoS guarantees.) QoS is best managed by the whole system, and in
> > my opinion this is the biggest reason to have a DSMS.
> >
> > For this reason, I would be inclined to put QoS constraints on the stream
> > definition, not on the query. For example, taking latency as the QoS
> metric
> > of interest, you could flag the Orders stream as “at most 10 ms latency
> > between the record’s timestamp and the wall-clock time of the server
> > receiving the records, and any records arriving after that time are
> logged
> > and discarded”, and that QoS constraint applies to both producers and
> > consumers.
> >
> > Given a query Q ‘select stream * from Orders’, it is valid to ask “what
> is
> > the expected latency of Q?” or tell the planner “produce an
> implementation
> > of Q with a latency of no more than 15 ms, and if you cannot achieve that
> > latency, fail”. You could even register Q in the system and tell the
> system
> > to tighten up the latency of any upstream streams and the standing
> queries
> > that populate them. But it’s not valid to say “execute Q with a latency
> of
> > 15 ms”: the system may not be able to achieve it.
> >
> > In summary: I would allow latency and late-row-handling and other QoS
> > annotations in the query but it’s not the most natural or powerful place
> to
> > put them.
> >
> > Julian
> >
> >
> > > On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
> > >
> > > Excellent! I missed the punctuations in the todo list.
> > >
> > > What kind of strategies do you have in mind to handle events that
> arrive
> > > too late? I see
> > > 1. dropping of late events
> > > 2. computing an updated window result for each late arriving
> > > element (implies that the window state is stored for a certain period
> > > before it is discarded)
> > > 3. computing a delta to the previous window result for each late
> arriving
> > > element (requires window state as well, not applicable to all
> aggregation
> > > types)
> > >
> > > It would be nice if strategies to handle late-arrivers could be defined
> > in
> > > the query.
> > >
> > > I think the plans of the Flink community are quite well aligned with
> your
> > > ideas for SQL on Streams.
> > > Should we start by updating / extending the Stream document on the
> > Calcite
> > > website to include the new window definitions (TUMBLE, HOP) and a
> > > discussion of punctuations/watermarks/time bounds?
> > >
> > > Fabian
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> > >
> > >> Let me rephrase: The *majority* of the literature, of which I cited
> > >> just one example, calls them punctuation, and a couple of recent
> > >> papers out of Mountain View doesn't change that.
> > >>
> > >> There are some fine distinctions between punctuation, heartbeats,
> > >> watermarks and rowtime bounds, mostly in terms of how they are
> > >> generated and propagated, that matter little when planning the query.
> > >>
> > >> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org>
> wrote:
> > >>>
> > >>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
> which
> > >>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
> > >>>> because it is feels more like a dynamic constraint than a piece of
> > >>>> data, but the literature[1] calls them punctuation.)
> > >>>>
> > >>>
> > >>> Some of the literature calls them punctuation, other literature [1]
> > calls
> > >>> them watermarks.
> > >>>
> > >>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> > >>
> >
> >
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: About Stream SQL

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

I agree, the Streaming page is a very good starting point for this
discussion. As suggested by Julian, I created CALCITE-1090 to update the
page such that it reflects the current state of the discussion (adding HOP
and TUMBLE functions, punctuations). I can also help with that, e.g., by
contributing figures, examples, or text, reviewing, or any other way.

>From my point of view, the semantics of the window types and the other
operators in the Streaming document are very good. What is missing are
joins (windowed stream-stream joins, stream-table joins, stream-table joins
with table updates) as already noted in the todo section.

Regarding the handling of late-arriving events, I am not sure if this is a
purely QoS issue as the result of a query might depend on the chosen
strategy. Also users might want to pick different strategies for different
operations, so late-arriver strategies are not necessarily end-to-end but
can be operator specific. However, I think these details should be
discussed in a separate thread.

I'd like to add a few words about the StreamSQL roadmap of the Flink
community.
We are currently preparing our codebase and will start to work on support
for structured queries on streams in the next weeks. Flink will support two
query interface, a SQL interface and a LINQ-style Table API [1]. Both will
be optimized and translated by Calcite. As a first step, we want to add
simple stream transformations such as selection and projection to both
interfaces. Next, we will add windowing support (starting with tumbling and
hopping windows) to the Table API (as is said before, our plans here are
well aligned with Julian's suggestions). Once this is done, we would extend
the SQL interface to support windows which is hopefully as simple as using
a parser that accepts window syntax.

So from our point of view, fixing the semantics of windows and extending
the optimizer accordingly is more urgent than agreeing on a syntax
(although the Table API syntax could be inspired by Calcite's StreamSQL
syntax [2]). I can also help implementing the missing features in Calcite.

Having a reference implementation with tests would be awesome and
definitely help.

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
[2]
https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E

2016-02-14 21:23 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Fabian,
>
> Apologies for the late reply.
>
> I would rather that the specification for streaming SQL was not too
> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
> all viable, and engines can differentiate themselves by the strength of
> their support for this. But for the SQL to be considered valid, I think the
> validator just needs to know that it can make progress.
>
> There is a large area of functionality I’d call “quality of service”
> (QoS). This includes latency, reliability, at-least-once, at-most-once or
> in-order guarantees, as well as the late-row-handling this thread is
> concerned with. What the QoS metrics have in common is that they are
> end-to-end. To deliver a high QoS to the consumer, the producer needs to
> conform to a high QoS. The QoS is beyond the control of the SQL statement.
> (Although you can ask what a SQL statement is able to deliver, given the
> upstream QoS guarantees.) QoS is best managed by the whole system, and in
> my opinion this is the biggest reason to have a DSMS.
>
> For this reason, I would be inclined to put QoS constraints on the stream
> definition, not on the query. For example, taking latency as the QoS metric
> of interest, you could flag the Orders stream as “at most 10 ms latency
> between the record’s timestamp and the wall-clock time of the server
> receiving the records, and any records arriving after that time are logged
> and discarded”, and that QoS constraint applies to both producers and
> consumers.
>
> Given a query Q ‘select stream * from Orders’, it is valid to ask “what is
> the expected latency of Q?” or tell the planner “produce an implementation
> of Q with a latency of no more than 15 ms, and if you cannot achieve that
> latency, fail”. You could even register Q in the system and tell the system
> to tighten up the latency of any upstream streams and the standing queries
> that populate them. But it’s not valid to say “execute Q with a latency of
> 15 ms”: the system may not be able to achieve it.
>
> In summary: I would allow latency and late-row-handling and other QoS
> annotations in the query but it’s not the most natural or powerful place to
> put them.
>
> Julian
>
>
> > On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
> >
> > Excellent! I missed the punctuations in the todo list.
> >
> > What kind of strategies do you have in mind to handle events that arrive
> > too late? I see
> > 1. dropping of late events
> > 2. computing an updated window result for each late arriving
> > element (implies that the window state is stored for a certain period
> > before it is discarded)
> > 3. computing a delta to the previous window result for each late arriving
> > element (requires window state as well, not applicable to all aggregation
> > types)
> >
> > It would be nice if strategies to handle late-arrivers could be defined
> in
> > the query.
> >
> > I think the plans of the Flink community are quite well aligned with your
> > ideas for SQL on Streams.
> > Should we start by updating / extending the Stream document on the
> Calcite
> > website to include the new window definitions (TUMBLE, HOP) and a
> > discussion of punctuations/watermarks/time bounds?
> >
> > Fabian
> >
> >
> >
> >
> >
> >
> > 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> >
> >> Let me rephrase: The *majority* of the literature, of which I cited
> >> just one example, calls them punctuation, and a couple of recent
> >> papers out of Mountain View doesn't change that.
> >>
> >> There are some fine distinctions between punctuation, heartbeats,
> >> watermarks and rowtime bounds, mostly in terms of how they are
> >> generated and propagated, that matter little when planning the query.
> >>
> >> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
> >>>
> >>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
> >>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
> >>>> because it is feels more like a dynamic constraint than a piece of
> >>>> data, but the literature[1] calls them punctuation.)
> >>>>
> >>>
> >>> Some of the literature calls them punctuation, other literature [1]
> calls
> >>> them watermarks.
> >>>
> >>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >>
>
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

Fabian,

Apologies for the late reply.

I would rather that the specification for streaming SQL was not too prescriptive for how late events were handled. Approaches 1, 2 and 3 are all viable, and engines can differentiate themselves by the strength of their support for this. But for the SQL to be considered valid, I think the validator just needs to know that it can make progress.

There is a large area of functionality I’d call “quality of service” (QoS). This includes latency, reliability, at-least-once, at-most-once or in-order guarantees, as well as the late-row-handling this thread is concerned with. What the QoS metrics have in common is that they are end-to-end. To deliver a high QoS to the consumer, the producer needs to conform to a high QoS. The QoS is beyond the control of the SQL statement. (Although you can ask what a SQL statement is able to deliver, given the upstream QoS guarantees.) QoS is best managed by the whole system, and in my opinion this is the biggest reason to have a DSMS.

For this reason, I would be inclined to put QoS constraints on the stream definition, not on the query. For example, taking latency as the QoS metric of interest, you could flag the Orders stream as “at most 10 ms latency between the record’s timestamp and the wall-clock time of the server receiving the records, and any records arriving after that time are logged and discarded”, and that QoS constraint applies to both producers and consumers.

Given a query Q ‘select stream * from Orders’, it is valid to ask “what is the expected latency of Q?” or tell the planner “produce an implementation of Q with a latency of no more than 15 ms, and if you cannot achieve that latency, fail”. You could even register Q in the system and tell the system to tighten up the latency of any upstream streams and the standing queries that populate them. But it’s not valid to say “execute Q with a latency of 15 ms”: the system may not be able to achieve it.

In summary: I would allow latency and late-row-handling and other QoS annotations in the query but it’s not the most natural or powerful place to put them.

Julian

> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Excellent! I missed the punctuations in the todo list.
> 
> What kind of strategies do you have in mind to handle events that arrive
> too late? I see
> 1. dropping of late events
> 2. computing an updated window result for each late arriving
> element (implies that the window state is stored for a certain period
> before it is discarded)
> 3. computing a delta to the previous window result for each late arriving
> element (requires window state as well, not applicable to all aggregation
> types)
> 
> It would be nice if strategies to handle late-arrivers could be defined in
> the query.
> 
> I think the plans of the Flink community are quite well aligned with your
> ideas for SQL on Streams.
> Should we start by updating / extending the Stream document on the Calcite
> website to include the new window definitions (TUMBLE, HOP) and a
> discussion of punctuations/watermarks/time bounds?
> 
> Fabian
> 
> 
> 
> 
> 
> 
> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> 
>> Let me rephrase: The *majority* of the literature, of which I cited
>> just one example, calls them punctuation, and a couple of recent
>> papers out of Mountain View doesn't change that.
>> 
>> There are some fine distinctions between punctuation, heartbeats,
>> watermarks and rowtime bounds, mostly in terms of how they are
>> generated and propagated, that matter little when planning the query.
>> 
>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 
>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>>>> because it is feels more like a dynamic constraint than a piece of
>>>> data, but the literature[1] calls them punctuation.)
>>>> 
>>> 
>>> Some of the literature calls them punctuation, other literature [1] calls
>>> them watermarks.
>>> 
>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>

答复: About Stream SQL

Posted by "Wanglan (Lan)" <la...@huawei.com>.

Great discussions!

It seems kind of agreement has been reached . In my opinion, the window definitions are the basic concepts we should clearly describe first. How is the progress? Do we need to create a jira or something? 

Btw, happy Chinese new year ;) !

Lan
-----邮件原件-----
发件人: Fabian Hueske [mailto:fhueske@gmail.com] 
发送时间: 2016年2月6日 17:29
收件人: dev@calcite.apache.org
主题: Re: About Stream SQL

Excellent! I missed the punctuations in the todo list.

What kind of strategies do you have in mind to handle events that arrive too late? I see 1. dropping of late events 2. computing an updated window result for each late arriving element (implies that the window state is stored for a certain period before it is discarded) 3. computing a delta to the previous window result for each late arriving element (requires window state as well, not applicable to all aggregation
types)

It would be nice if strategies to handle late-arrivers could be defined in the query.

I think the plans of the Flink community are quite well aligned with your ideas for SQL on Streams.
Should we start by updating / extending the Stream document on the Calcite website to include the new window definitions (TUMBLE, HOP) and a discussion of punctuations/watermarks/time bounds?

Fabian

2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Let me rephrase: The *majority* of the literature, of which I cited 
> just one example, calls them punctuation, and a couple of recent 
> papers out of Mountain View doesn't change that.
>
> There are some fine distinctions between punctuation, heartbeats, 
> watermarks and rowtime bounds, mostly in terms of how they are 
> generated and propagated, that matter little when planning the query.
>
> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
> > On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> Yes, watermarks, absolutely. The "to do" list has "punctuation", 
> >> which is the same thing. (Actually, I prefer to call it "rowtime bound"
> >> because it is feels more like a dynamic constraint than a piece of 
> >> data, but the literature[1] calls them punctuation.)
> >>
> >
> > Some of the literature calls them punctuation, other literature [1] 
> > calls them watermarks.
> >
> > [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

> On Feb 13, 2016, at 11:38 PM, Wanglan (Lan) <la...@huawei.com> wrote:
> 
> Great discussions!
> 
> It seems kind of agreement has been reached . In my opinion, the window definitions are the basic concepts we should clearly describe first. How is the progress? Do we need to create a jira or something? 

I think we have a good start in http://calcite.apache.org/docs/stream.html, and some extra ideas in my HOP/TUMBLE email[1]. I promised to write them up, but I haven’t got around to it yet, and a JIRA would help me remember.

I don’t know whether we can ever say we are “done” with a specification. Of course I can write down my opinion, but it would just be my opinion. :) If we have a conversation thread (or a JIRA case) about each requirement, and representatives of the streaming projects (Fabian for Flink, Milinda for Samza, ? for Storm) chime in, maybe we can reach consensus.

After we reach consensus on a particular feature, and write it up, I’d also like to create a set of sample queries & responses that illustrate that feature. Calcite could contain a TCK that any compliant SQL engine could run. How do people feel about having tests as a deliverable? Would you use them in your project?

> Btw, happy Chinese new year ;) !

Thank you! And you too!

Julian

[1] http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E

> 
> Lan
> 
> -----邮件原件-----
> 发件人: Fabian Hueske [mailto:fhueske@gmail.com] 
> 发送时间: 2016年2月6日 17:29
> 收件人: dev@calcite.apache.org
> 主题: Re: About Stream SQL
> 
> Excellent! I missed the punctuations in the todo list.
> 
> What kind of strategies do you have in mind to handle events that arrive too late? I see 1. dropping of late events 2. computing an updated window result for each late arriving element (implies that the window state is stored for a certain period before it is discarded) 3. computing a delta to the previous window result for each late arriving element (requires window state as well, not applicable to all aggregation
> types)
> 
> It would be nice if strategies to handle late-arrivers could be defined in the query.
> 
> I think the plans of the Flink community are quite well aligned with your ideas for SQL on Streams.
> Should we start by updating / extending the Stream document on the Calcite website to include the new window definitions (TUMBLE, HOP) and a discussion of punctuations/watermarks/time bounds?
> 
> Fabian
> 
> 
> 
> 
> 
> 
> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:
> 
>> Let me rephrase: The *majority* of the literature, of which I cited 
>> just one example, calls them punctuation, and a couple of recent 
>> papers out of Mountain View doesn't change that.
>> 
>> There are some fine distinctions between punctuation, heartbeats, 
>> watermarks and rowtime bounds, mostly in terms of how they are 
>> generated and propagated, that matter little when planning the query.
>> 
>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 
>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", 
>>>> which is the same thing. (Actually, I prefer to call it "rowtime bound"
>>>> because it is feels more like a dynamic constraint than a piece of 
>>>> data, but the literature[1] calls them punctuation.)
>>>> 
>>> 
>>> Some of the literature calls them punctuation, other literature [1] 
>>> calls them watermarks.
>>> 
>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>

答复: About Stream SQL

Posted by "Wanglan (Lan)" <la...@huawei.com>.

Great discussions!

It seems kind of agreement has been reached . In my opinion, the window definitions are the basic concepts we should clearly describe first. How is the progress? Do we need to create a jira or something? 

Btw, happy Chinese new year ;) !

Lan

-----邮件原件-----
发件人: Fabian Hueske [mailto:fhueske@gmail.com] 
发送时间: 2016年2月6日 17:29
收件人: dev@calcite.apache.org
主题: Re: About Stream SQL

Excellent! I missed the punctuations in the todo list.

What kind of strategies do you have in mind to handle events that arrive too late? I see 1. dropping of late events 2. computing an updated window result for each late arriving element (implies that the window state is stored for a certain period before it is discarded) 3. computing a delta to the previous window result for each late arriving element (requires window state as well, not applicable to all aggregation
types)

It would be nice if strategies to handle late-arrivers could be defined in the query.

I think the plans of the Flink community are quite well aligned with your ideas for SQL on Streams.
Should we start by updating / extending the Stream document on the Calcite website to include the new window definitions (TUMBLE, HOP) and a discussion of punctuations/watermarks/time bounds?

Fabian

2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Let me rephrase: The *majority* of the literature, of which I cited 
> just one example, calls them punctuation, and a couple of recent 
> papers out of Mountain View doesn't change that.
>
> There are some fine distinctions between punctuation, heartbeats, 
> watermarks and rowtime bounds, mostly in terms of how they are 
> generated and propagated, that matter little when planning the query.
>
> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
> > On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> Yes, watermarks, absolutely. The "to do" list has "punctuation", 
> >> which is the same thing. (Actually, I prefer to call it "rowtime bound"
> >> because it is feels more like a dynamic constraint than a piece of 
> >> data, but the literature[1] calls them punctuation.)
> >>
> >
> > Some of the literature calls them punctuation, other literature [1] 
> > calls them watermarks.
> >
> > [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>

答复: About Stream SQL

Posted by "Wanglan (Lan)" <la...@huawei.com>.

Great discussions!

It seems kind of agreement has been reached . In my opinion, the window definitions are the basic concepts we should clearly describe first. How is the progress? Do we need to create a jira or something? 

Btw, happy Chinese new year ;) !

Lan

-----邮件原件-----
发件人: Fabian Hueske [mailto:fhueske@gmail.com] 
发送时间: 2016年2月6日 17:29
收件人: dev@calcite.apache.org
主题: Re: About Stream SQL

Excellent! I missed the punctuations in the todo list.

What kind of strategies do you have in mind to handle events that arrive too late? I see 1. dropping of late events 2. computing an updated window result for each late arriving element (implies that the window state is stored for a certain period before it is discarded) 3. computing a delta to the previous window result for each late arriving element (requires window state as well, not applicable to all aggregation
types)

It would be nice if strategies to handle late-arrivers could be defined in the query.

I think the plans of the Flink community are quite well aligned with your ideas for SQL on Streams.
Should we start by updating / extending the Stream document on the Calcite website to include the new window definitions (TUMBLE, HOP) and a discussion of punctuations/watermarks/time bounds?

Fabian

2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Let me rephrase: The *majority* of the literature, of which I cited 
> just one example, calls them punctuation, and a couple of recent 
> papers out of Mountain View doesn't change that.
>
> There are some fine distinctions between punctuation, heartbeats, 
> watermarks and rowtime bounds, mostly in terms of how they are 
> generated and propagated, that matter little when planning the query.
>
> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
> > On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> Yes, watermarks, absolutely. The "to do" list has "punctuation", 
> >> which is the same thing. (Actually, I prefer to call it "rowtime bound"
> >> because it is feels more like a dynamic constraint than a piece of 
> >> data, but the literature[1] calls them punctuation.)
> >>
> >
> > Some of the literature calls them punctuation, other literature [1] 
> > calls them watermarks.
> >
> > [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>

Re: About Stream SQL

Posted by Fabian Hueske <fh...@gmail.com>.

Excellent! I missed the punctuations in the todo list.

What kind of strategies do you have in mind to handle events that arrive
too late? I see
1. dropping of late events
2. computing an updated window result for each late arriving
element (implies that the window state is stored for a certain period
before it is discarded)
3. computing a delta to the previous window result for each late arriving
element (requires window state as well, not applicable to all aggregation
types)

It would be nice if strategies to handle late-arrivers could be defined in
the query.

I think the plans of the Flink community are quite well aligned with your
ideas for SQL on Streams.
Should we start by updating / extending the Stream document on the Calcite
website to include the new window definitions (TUMBLE, HOP) and a
discussion of punctuations/watermarks/time bounds?

Fabian

2016-02-06 2:35 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Let me rephrase: The *majority* of the literature, of which I cited
> just one example, calls them punctuation, and a couple of recent
> papers out of Mountain View doesn't change that.
>
> There are some fine distinctions between punctuation, heartbeats,
> watermarks and rowtime bounds, mostly in terms of how they are
> generated and propagated, that matter little when planning the query.
>
> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
> > On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
> >> is the same thing. (Actually, I prefer to call it "rowtime bound"
> >> because it is feels more like a dynamic constraint than a piece of
> >> data, but the literature[1] calls them punctuation.)
> >>
> >
> > Some of the literature calls them punctuation, other literature [1] calls
> > them watermarks.
> >
> > [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

Let me rephrase: The *majority* of the literature, of which I cited
just one example, calls them punctuation, and a couple of recent
papers out of Mountain View doesn't change that.

There are some fine distinctions between punctuation, heartbeats,
watermarks and rowtime bounds, mostly in terms of how they are
generated and propagated, that matter little when planning the query.

On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <te...@gmail.com> wrote:
> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:
>
>> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>> because it is feels more like a dynamic constraint than a piece of
>> data, but the literature[1] calls them punctuation.)
>>
>
> Some of the literature calls them punctuation, other literature [1] calls
> them watermarks.
>
> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Re: About Stream SQL

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jh...@apache.org> wrote:

> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
> is the same thing. (Actually, I prefer to call it "rowtime bound"
> because it is feels more like a dynamic constraint than a piece of
> data, but the literature[1] calls them punctuation.)
>

Some of the literature calls them punctuation, other literature [1] calls
them watermarks.

[1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

Yes, watermarks, absolutely. The "to do" list has "punctuation", which
is the same thing. (Actually, I prefer to call it "rowtime bound"
because it is feels more like a dynamic constraint than a piece of
data, but the literature[1] calls them punctuation.)

If a stream has punctuation enabled then it may not be sorted but is
nevertheless sortable. So I'm doing that theoreticians often do --
transposing the problem into a simpler domain.

By the way, an out-of-order stream is also sortable if it is t-sorted
(i.e. every record is guaranteed to arrive within t seconds of its
timestamp) or k-sorted (i.e. every record is guaranteed to be no more
than k positions out of order). So queries on these streams can be
planned similarly to queries on streams with punctuation.

And, we often want to aggregate over attributes that are not
time-based but are nevertheless monotonic. "The number of times a team
has shifted between winning-state and losing-state" is one such
monotonic attribute. The system needs to figure out for itself that it
is safe to aggregate over such an attribute; punctuation does not add
any extra information.

I have in mind some metadata (cost metrics) for the planner:

1. Is this stream sorted on a given attribute (or attributes)? (RelMdCollation)
2. Is it possible to sort the stream on a given attribute? (For finite
relations, the answer is always "yes"; for streams it depends on the
existence of punctuation, or linkage between the attributes and the
sort key.)
3. What latency do we need to introduce in order to perform that sort?
4. What is the cost (in CPU, memory etc.) of performing that sort?

We already have (1), in BuiltInMetadata.Collation [2]. For (2), the
answer is always "true" for finite relations. But we'll need to
implement 2, 3 and 4 for streams.

Julian

[1] http://www.whitworth.edu/academic/department/mathcomputerscience/faculty/tuckerpeter/pdf/117896_final.pdf

[2] https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html

On Fri, Feb 5, 2016 at 3:46 PM, Fabian Hueske <fh...@gmail.com> wrote:
> Hi,
>
> first of all, thanks for starting this discussion. As Stephan said before,
> the Flink community is working towards support for SQL on streams. IMO, it
> would be very nice if the different efforts for SQL on streams could
> converge towards a core set of semantics and syntax.
>
> I read the proposal on the Calcite website [1] and the mail thread about
> the TUMBLE and HOP functions [2]. IMO, the discussed windowing semantics
> are well defined and could serve as a common basis. The improved syntax for
> tumbling and hopping windows is very nice (would be good to update the
> Calcite website with this). More concise definitions for sliding and
> cascading windows would be great, too.
>
> One concern that I have is the requirement of a monotonic attribute. Such
> an attribute might be present in use cases where events are generated at a
> central source, such as transactions of a database system. However, events
> that originate from distributed sources such as sensors, log events of
> cluster machines, etc, do not arrive in timestamp order at a stream
> processor. In fact, the majority of use cases from Flink users has to deal
> with out-of-order events. Flink (and Google Cloud Dataflow / Apache Beam
> incubating) use the notion of event time and watermarks [3][4] to handle
> streams with out-of-order events. While watermarks can help a lot to
> improve the consistency of results, it is not possible to guarantee that no
> late events arrive at an operator. There are different strategies to deal
> with late arriving events (dropping, recomputation of aggregates, ...), but
> usually they have a negative effect on the result correctness and/or a
> downstream systems needs to deal with them, e.g., updating previous results.
>
> Relaxing the requirement for monotonic attributes to attributes with
> watermarks (watermarks are provided by the data source), would have no
> impact on the corrects of queries over tables with an ordered attribute
> (since there would be no late arriving events). It would also not change
> the semantics of queries over streams with a monotonic attribute, but would
> also allow for streams with (slightly) out-of-order arriving events at the
> cost that query results may become inconsistent in case of late arriving
> results.
>
> I agree that defining use cases and queries over some sample data would be
> a good way to start.
>
> Best, Fabian
>
> [1] http://calcite.apache.org/docs/stream.html
> [2]
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
> <javascript:void(0)>
> [3]
> http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
> [4]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html#working-with-time
>
> 2016-02-05 11:29 GMT+01:00 Julian Hyde <jh...@apache.org>:
>
>> Stephan,
>>
>> I agree that we have a long way to go to get to a standard, but I
>> think that means that we should start as soon as possible. By which I
>> mean, rather than going ahead and creating Flink-specific extensions,
>> let's have discussions about SQL extensions in a broad forum, pulling
>> in members of other projects. Streaming/CEP is not a new field, so the
>> use cases are well known.
>>
>> It's true that Calcite's streaming SQL doesn't go far beyond standard
>> SQL. I don't want to diverge too far; and besides, one can accomplish
>> a lot with each feature added to the language. What is needed are a
>> very few well chosen deep features; then we can add liberal syntactic
>> sugar on top of these to to make common use cases more concise.
>>
>> For the record, HOP and TUMBLE described in [1] are not OLAP sliding
>> windows; they go in the GROUP BY clause. But unlike GROUP BY, they
>> allow each row to contribute to more than one sub-total. This is novel
>> in SQL (only GROUPING SETS allows this, and in a limited form) and
>> could be the basis for user-defined windows.
>>
>> Also, our sliding windows can be defined by row count as well as by
>> time. For example, suppose you want to calculate the length of a
>> sport's team's streak of consecutive wins or losses. You can partition
>> by N, where N is the number of state changes of the win-loss variable,
>> and so each switch from win-to-lose or lose-to-win starts a new
>> window.
>>
>> As you define your extensions to SQL, I strongly suggest that you make
>> explicit, as columns, any data values that drive behavior. These might
>> include event times, arrival times, processing times, flush
>> directives, and any required orderings. If information is not
>> explicit, we can not reason about the query as algebra, and the
>> planner cannot more radical plans, such as sorting (within a window)
>> or changing how the rows are partitioned across parallel processors. A
>> litmus test is whether a database can apply the same SQL to the data
>> archived from the stream and to achieve the same results.
>>
>> I saw that Fabian blogged recently[2] about stream windows in Flink.
>> Could we start the process by trying to convert some use cases
>> (expressed in English, and with sample input and output data) into
>> SQL? Then we can iterate to make the SQL concise, understandable, and
>> well-defined.
>>
>> Julian
>>
>> [1]
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>>
>> [2] https://flink.apache.org/news/2015/12/04/Introducing-windows.html
>>
>> On Thu, Feb 4, 2016 at 1:35 AM, Stephan Ewen <se...@apache.org> wrote:
>> > Hi!
>> >
>> > True, the Flink community is looking into stream SQL, and is currently
>> > building on top of Calcite. This is all going well, but we probably need
>> > some custom syntax around windowing.
>> >
>> > For Stream SQL Windowing, what I have seen so far in Calcite (correct me
>> if
>> > I am wrong there), is pretty much a variant of the OLAP sliding window
>> > aggregates.
>> >
>> >   - Windows are in those basically calculated by rounding down/up
>> > timestamps, thus bucketizing the events. That works for many cases, but
>> is
>> > quite tricky syntax.
>> >
>> >   - Flink supports various notions of time for windowing (processing
>> time,
>> > ingestion time, event time), as well as triggers. To be able to extend
>> the
>> > window specification with such additional parameters is pretty crucial
>> and
>> > would probably go well with a dedicated window clause.
>> >
>> >   - Flink also has unaligned windows (sessions, timeouts, ...) which are
>> > very hard to map to grouping and window aggregations across ordered
>> groups.
>> >
>> >
>> > Converging to a core standard around stream SQL is very desirable, I
>> > completely agree.
>> > For the basic constructs, I think this is quite feasible and Calcite has
>> > some good suggestions there.
>> >
>> > In the advanced constructs, the systems differ quite heavily currently,
>> so
>> > converging there may be harder there. Also, we are just learning what
>> > semantics people need concerning windowing/event time/etc. May almost be
>> a
>> > tad bit too early to try and define a standard there...
>> >
>> >
>> > Greetings,
>> > Stephan
>> >
>> >
>> > On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <jh...@apache.org> wrote:
>> >
>> >> I totally agree with you. (Sorry for the delayed response; this week has
>> >> been very busy.)
>> >>
>> >> There is a tendency of vendors (and projects) to think that their
>> >> technology is unique, and superior to everyone else’s, and want to
>> showcase
>> >> it in their dialect of SQL. That is natural, and it’s OK, since it makes
>> >> them strive to make their technology better.
>> >>
>> >> However, they have to remember that the end users don’t want something
>> >> unique, they want something that solves their problem. They would like
>> >> something that is standards compliant so that it is easy to learn, easy
>> to
>> >> hire developers for, and — if the worst comes to the worst — easy to
>> >> migrate to a compatible competing technology.
>> >>
>> >> I know the developers at Storm and Flink (and Samza too) and they
>> >> understand the importance of collaborating on a standard.
>> >>
>> >> I have been trying to play a dual role: supplying the parser and planner
>> >> for streaming SQL, and also to facilitate the creation of a standard
>> >> language and semantics of streaming SQL. For the latter, see Streaming
>> page
>> >> on Calcite’s web site[1]. On that page, I intend to illustrate all of
>> the
>> >> main patterns of streaming queries, give them names (e.g. “Tumbling
>> >> windows”), and show how those translate into streaming SQL.
>> >>
>> >> Also, it would be useful to create a reference implementation of
>> streaming
>> >> SQL in Calcite so that you can validate and run queries. The
>> performance,
>> >> scalability and reliability will not be the same as if you ran Storm,
>> Flink
>> >> or Samza, but at least you can see what the semantics should be.
>> >>
>> >> I believe that most, if not all, of the examples that the projects are
>> >> coming up with can be translated into SQL. It will be challenging,
>> because
>> >> we want to preserve the semantics of SQL, allow streaming SQL to
>> >> interoperate with traditional relations, and also retain the general
>> look
>> >> and feel of SQL. (For example, I fought quite hard[2] recently for the
>> >> principle that GROUP BY defines a partition (in the set-theory sense)[3]
>> >> and therefore could not be used to represent a tumbling window, until I
>> >> remembered that GROUPING SETS already allows each input row to appear in
>> >> more than one output sub-total.)
>> >>
>> >> What can you, the users, do? Get involved in the discussion about what
>> you
>> >> want in the language. Encourage the projects to bring their proposed SQL
>> >> features into this forum for discussion, and add to the list of patterns
>> >> and examples on the Streaming page. As in any standards process, the
>> users
>> >> help to keep the vendors focused.
>> >>
>> >> I’ll be talking about streaming SQL, planning, and standardization at
>> the
>> >> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
>> >> stop by.
>> >>
>> >> Julian
>> >>
>> >> [1] http://calcite.apache.org/docs/stream.html
>> >>
>> >> [2]
>> >>
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>> >>
>> >> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
>> >>
>> >> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
>> >>
>> >> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <la...@huawei.com>
>> >> wrote:
>> >> >
>> >> > Hi to all,
>> >> >
>> >> > I am from Huawei and am focusing on data stream processing.
>> >> > Recently I noticed that both in Storm community and Flink community
>> >> there are endeavors to user Calcite as SQL parser to enable Storm/Flink
>> to
>> >> support SQL. They both want to supplemented or clarify Streaming SQL of
>> >> calcite, especially the definition of windows.
>> >> > I am considering if both communities working on designing Stream SQL
>> >> syntax separately, there would come out two different syntaxes which
>> >> represent the same use case.
>> >> > Therefore, I am wondering if it is possible to unify such work, i.e.
>> >> design and compliment the calcite Streaming SQL to enrich window
>> definition
>> >> so that both storm and flink can reuse the calcite(Streaming SQL) as
>> their
>> >> SQL parser for streaming cases with little change.
>> >> > What do you think about this idea?
>> >> >
>> >>
>> >>
>>

Re: About Stream SQL

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

first of all, thanks for starting this discussion. As Stephan said before,
the Flink community is working towards support for SQL on streams. IMO, it
would be very nice if the different efforts for SQL on streams could
converge towards a core set of semantics and syntax.

I read the proposal on the Calcite website [1] and the mail thread about
the TUMBLE and HOP functions [2]. IMO, the discussed windowing semantics
are well defined and could serve as a common basis. The improved syntax for
tumbling and hopping windows is very nice (would be good to update the
Calcite website with this). More concise definitions for sliding and
cascading windows would be great, too.

One concern that I have is the requirement of a monotonic attribute. Such
an attribute might be present in use cases where events are generated at a
central source, such as transactions of a database system. However, events
that originate from distributed sources such as sensors, log events of
cluster machines, etc, do not arrive in timestamp order at a stream
processor. In fact, the majority of use cases from Flink users has to deal
with out-of-order events. Flink (and Google Cloud Dataflow / Apache Beam
incubating) use the notion of event time and watermarks [3][4] to handle
streams with out-of-order events. While watermarks can help a lot to
improve the consistency of results, it is not possible to guarantee that no
late events arrive at an operator. There are different strategies to deal
with late arriving events (dropping, recomputation of aggregates, ...), but
usually they have a negative effect on the result correctness and/or a
downstream systems needs to deal with them, e.g., updating previous results.

Relaxing the requirement for monotonic attributes to attributes with
watermarks (watermarks are provided by the data source), would have no
impact on the corrects of queries over tables with an ordered attribute
(since there would be no late arriving events). It would also not change
the semantics of queries over streams with a monotonic attribute, but would
also allow for streams with (slightly) out-of-order arriving events at the
cost that query results may become inconsistent in case of late arriving
results.

I agree that defining use cases and queries over some sample data would be
a good way to start.

Best, Fabian

[1] http://calcite.apache.org/docs/stream.html
[2]
http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
<javascript:void(0)>
[3]
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
[4]
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html#working-with-time

2016-02-05 11:29 GMT+01:00 Julian Hyde <jh...@apache.org>:

> Stephan,
>
> I agree that we have a long way to go to get to a standard, but I
> think that means that we should start as soon as possible. By which I
> mean, rather than going ahead and creating Flink-specific extensions,
> let's have discussions about SQL extensions in a broad forum, pulling
> in members of other projects. Streaming/CEP is not a new field, so the
> use cases are well known.
>
> It's true that Calcite's streaming SQL doesn't go far beyond standard
> SQL. I don't want to diverge too far; and besides, one can accomplish
> a lot with each feature added to the language. What is needed are a
> very few well chosen deep features; then we can add liberal syntactic
> sugar on top of these to to make common use cases more concise.
>
> For the record, HOP and TUMBLE described in [1] are not OLAP sliding
> windows; they go in the GROUP BY clause. But unlike GROUP BY, they
> allow each row to contribute to more than one sub-total. This is novel
> in SQL (only GROUPING SETS allows this, and in a limited form) and
> could be the basis for user-defined windows.
>
> Also, our sliding windows can be defined by row count as well as by
> time. For example, suppose you want to calculate the length of a
> sport's team's streak of consecutive wins or losses. You can partition
> by N, where N is the number of state changes of the win-loss variable,
> and so each switch from win-to-lose or lose-to-win starts a new
> window.
>
> As you define your extensions to SQL, I strongly suggest that you make
> explicit, as columns, any data values that drive behavior. These might
> include event times, arrival times, processing times, flush
> directives, and any required orderings. If information is not
> explicit, we can not reason about the query as algebra, and the
> planner cannot more radical plans, such as sorting (within a window)
> or changing how the rows are partitioned across parallel processors. A
> litmus test is whether a database can apply the same SQL to the data
> archived from the stream and to achieve the same results.
>
> I saw that Fabian blogged recently[2] about stream windows in Flink.
> Could we start the process by trying to convert some use cases
> (expressed in English, and with sample input and output data) into
> SQL? Then we can iterate to make the SQL concise, understandable, and
> well-defined.
>
> Julian
>
> [1]
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>
> [2] https://flink.apache.org/news/2015/12/04/Introducing-windows.html
>
> On Thu, Feb 4, 2016 at 1:35 AM, Stephan Ewen <se...@apache.org> wrote:
> > Hi!
> >
> > True, the Flink community is looking into stream SQL, and is currently
> > building on top of Calcite. This is all going well, but we probably need
> > some custom syntax around windowing.
> >
> > For Stream SQL Windowing, what I have seen so far in Calcite (correct me
> if
> > I am wrong there), is pretty much a variant of the OLAP sliding window
> > aggregates.
> >
> >   - Windows are in those basically calculated by rounding down/up
> > timestamps, thus bucketizing the events. That works for many cases, but
> is
> > quite tricky syntax.
> >
> >   - Flink supports various notions of time for windowing (processing
> time,
> > ingestion time, event time), as well as triggers. To be able to extend
> the
> > window specification with such additional parameters is pretty crucial
> and
> > would probably go well with a dedicated window clause.
> >
> >   - Flink also has unaligned windows (sessions, timeouts, ...) which are
> > very hard to map to grouping and window aggregations across ordered
> groups.
> >
> >
> > Converging to a core standard around stream SQL is very desirable, I
> > completely agree.
> > For the basic constructs, I think this is quite feasible and Calcite has
> > some good suggestions there.
> >
> > In the advanced constructs, the systems differ quite heavily currently,
> so
> > converging there may be harder there. Also, we are just learning what
> > semantics people need concerning windowing/event time/etc. May almost be
> a
> > tad bit too early to try and define a standard there...
> >
> >
> > Greetings,
> > Stephan
> >
> >
> > On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> I totally agree with you. (Sorry for the delayed response; this week has
> >> been very busy.)
> >>
> >> There is a tendency of vendors (and projects) to think that their
> >> technology is unique, and superior to everyone else’s, and want to
> showcase
> >> it in their dialect of SQL. That is natural, and it’s OK, since it makes
> >> them strive to make their technology better.
> >>
> >> However, they have to remember that the end users don’t want something
> >> unique, they want something that solves their problem. They would like
> >> something that is standards compliant so that it is easy to learn, easy
> to
> >> hire developers for, and — if the worst comes to the worst — easy to
> >> migrate to a compatible competing technology.
> >>
> >> I know the developers at Storm and Flink (and Samza too) and they
> >> understand the importance of collaborating on a standard.
> >>
> >> I have been trying to play a dual role: supplying the parser and planner
> >> for streaming SQL, and also to facilitate the creation of a standard
> >> language and semantics of streaming SQL. For the latter, see Streaming
> page
> >> on Calcite’s web site[1]. On that page, I intend to illustrate all of
> the
> >> main patterns of streaming queries, give them names (e.g. “Tumbling
> >> windows”), and show how those translate into streaming SQL.
> >>
> >> Also, it would be useful to create a reference implementation of
> streaming
> >> SQL in Calcite so that you can validate and run queries. The
> performance,
> >> scalability and reliability will not be the same as if you ran Storm,
> Flink
> >> or Samza, but at least you can see what the semantics should be.
> >>
> >> I believe that most, if not all, of the examples that the projects are
> >> coming up with can be translated into SQL. It will be challenging,
> because
> >> we want to preserve the semantics of SQL, allow streaming SQL to
> >> interoperate with traditional relations, and also retain the general
> look
> >> and feel of SQL. (For example, I fought quite hard[2] recently for the
> >> principle that GROUP BY defines a partition (in the set-theory sense)[3]
> >> and therefore could not be used to represent a tumbling window, until I
> >> remembered that GROUPING SETS already allows each input row to appear in
> >> more than one output sub-total.)
> >>
> >> What can you, the users, do? Get involved in the discussion about what
> you
> >> want in the language. Encourage the projects to bring their proposed SQL
> >> features into this forum for discussion, and add to the list of patterns
> >> and examples on the Streaming page. As in any standards process, the
> users
> >> help to keep the vendors focused.
> >>
> >> I’ll be talking about streaming SQL, planning, and standardization at
> the
> >> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
> >> stop by.
> >>
> >> Julian
> >>
> >> [1] http://calcite.apache.org/docs/stream.html
> >>
> >> [2]
> >>
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
> >>
> >> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
> >>
> >> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
> >>
> >> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <la...@huawei.com>
> >> wrote:
> >> >
> >> > Hi to all,
> >> >
> >> > I am from Huawei and am focusing on data stream processing.
> >> > Recently I noticed that both in Storm community and Flink community
> >> there are endeavors to user Calcite as SQL parser to enable Storm/Flink
> to
> >> support SQL. They both want to supplemented or clarify Streaming SQL of
> >> calcite, especially the definition of windows.
> >> > I am considering if both communities working on designing Stream SQL
> >> syntax separately, there would come out two different syntaxes which
> >> represent the same use case.
> >> > Therefore, I am wondering if it is possible to unify such work, i.e.
> >> design and compliment the calcite Streaming SQL to enrich window
> definition
> >> so that both storm and flink can reuse the calcite(Streaming SQL) as
> their
> >> SQL parser for streaming cases with little change.
> >> > What do you think about this idea?
> >> >
> >>
> >>
>

Re: About Stream SQL

Posted by Julian Hyde <jh...@apache.org>.

Stephan,

I agree that we have a long way to go to get to a standard, but I
think that means that we should start as soon as possible. By which I
mean, rather than going ahead and creating Flink-specific extensions,
let's have discussions about SQL extensions in a broad forum, pulling
in members of other projects. Streaming/CEP is not a new field, so the
use cases are well known.

It's true that Calcite's streaming SQL doesn't go far beyond standard
SQL. I don't want to diverge too far; and besides, one can accomplish
a lot with each feature added to the language. What is needed are a
very few well chosen deep features; then we can add liberal syntactic
sugar on top of these to to make common use cases more concise.

For the record, HOP and TUMBLE described in [1] are not OLAP sliding
windows; they go in the GROUP BY clause. But unlike GROUP BY, they
allow each row to contribute to more than one sub-total. This is novel
in SQL (only GROUPING SETS allows this, and in a limited form) and
could be the basis for user-defined windows.

Also, our sliding windows can be defined by row count as well as by
time. For example, suppose you want to calculate the length of a
sport's team's streak of consecutive wins or losses. You can partition
by N, where N is the number of state changes of the win-loss variable,
and so each switch from win-to-lose or lose-to-win starts a new
window.

As you define your extensions to SQL, I strongly suggest that you make
explicit, as columns, any data values that drive behavior. These might
include event times, arrival times, processing times, flush
directives, and any required orderings. If information is not
explicit, we can not reason about the query as algebra, and the
planner cannot more radical plans, such as sorting (within a window)
or changing how the rows are partitioned across parallel processors. A
litmus test is whether a database can apply the same SQL to the data
archived from the stream and to achieve the same results.

I saw that Fabian blogged recently[2] about stream windows in Flink.
Could we start the process by trying to convert some use cases
(expressed in English, and with sample input and output data) into
SQL? Then we can iterate to make the SQL concise, understandable, and
well-defined.

Julian

[1] http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E

[2] https://flink.apache.org/news/2015/12/04/Introducing-windows.html

On Thu, Feb 4, 2016 at 1:35 AM, Stephan Ewen <se...@apache.org> wrote:
> Hi!
>
> True, the Flink community is looking into stream SQL, and is currently
> building on top of Calcite. This is all going well, but we probably need
> some custom syntax around windowing.
>
> For Stream SQL Windowing, what I have seen so far in Calcite (correct me if
> I am wrong there), is pretty much a variant of the OLAP sliding window
> aggregates.
>
>   - Windows are in those basically calculated by rounding down/up
> timestamps, thus bucketizing the events. That works for many cases, but is
> quite tricky syntax.
>
>   - Flink supports various notions of time for windowing (processing time,
> ingestion time, event time), as well as triggers. To be able to extend the
> window specification with such additional parameters is pretty crucial and
> would probably go well with a dedicated window clause.
>
>   - Flink also has unaligned windows (sessions, timeouts, ...) which are
> very hard to map to grouping and window aggregations across ordered groups.
>
>
> Converging to a core standard around stream SQL is very desirable, I
> completely agree.
> For the basic constructs, I think this is quite feasible and Calcite has
> some good suggestions there.
>
> In the advanced constructs, the systems differ quite heavily currently, so
> converging there may be harder there. Also, we are just learning what
> semantics people need concerning windowing/event time/etc. May almost be a
> tad bit too early to try and define a standard there...
>
>
> Greetings,
> Stephan
>
>
> On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <jh...@apache.org> wrote:
>
>> I totally agree with you. (Sorry for the delayed response; this week has
>> been very busy.)
>>
>> There is a tendency of vendors (and projects) to think that their
>> technology is unique, and superior to everyone else’s, and want to showcase
>> it in their dialect of SQL. That is natural, and it’s OK, since it makes
>> them strive to make their technology better.
>>
>> However, they have to remember that the end users don’t want something
>> unique, they want something that solves their problem. They would like
>> something that is standards compliant so that it is easy to learn, easy to
>> hire developers for, and — if the worst comes to the worst — easy to
>> migrate to a compatible competing technology.
>>
>> I know the developers at Storm and Flink (and Samza too) and they
>> understand the importance of collaborating on a standard.
>>
>> I have been trying to play a dual role: supplying the parser and planner
>> for streaming SQL, and also to facilitate the creation of a standard
>> language and semantics of streaming SQL. For the latter, see Streaming page
>> on Calcite’s web site[1]. On that page, I intend to illustrate all of the
>> main patterns of streaming queries, give them names (e.g. “Tumbling
>> windows”), and show how those translate into streaming SQL.
>>
>> Also, it would be useful to create a reference implementation of streaming
>> SQL in Calcite so that you can validate and run queries. The performance,
>> scalability and reliability will not be the same as if you ran Storm, Flink
>> or Samza, but at least you can see what the semantics should be.
>>
>> I believe that most, if not all, of the examples that the projects are
>> coming up with can be translated into SQL. It will be challenging, because
>> we want to preserve the semantics of SQL, allow streaming SQL to
>> interoperate with traditional relations, and also retain the general look
>> and feel of SQL. (For example, I fought quite hard[2] recently for the
>> principle that GROUP BY defines a partition (in the set-theory sense)[3]
>> and therefore could not be used to represent a tumbling window, until I
>> remembered that GROUPING SETS already allows each input row to appear in
>> more than one output sub-total.)
>>
>> What can you, the users, do? Get involved in the discussion about what you
>> want in the language. Encourage the projects to bring their proposed SQL
>> features into this forum for discussion, and add to the list of patterns
>> and examples on the Streaming page. As in any standards process, the users
>> help to keep the vendors focused.
>>
>> I’ll be talking about streaming SQL, planning, and standardization at the
>> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
>> stop by.
>>
>> Julian
>>
>> [1] http://calcite.apache.org/docs/stream.html
>>
>> [2]
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>>
>> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
>>
>> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
>>
>> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <la...@huawei.com>
>> wrote:
>> >
>> > Hi to all,
>> >
>> > I am from Huawei and am focusing on data stream processing.
>> > Recently I noticed that both in Storm community and Flink community
>> there are endeavors to user Calcite as SQL parser to enable Storm/Flink to
>> support SQL. They both want to supplemented or clarify Streaming SQL of
>> calcite, especially the definition of windows.
>> > I am considering if both communities working on designing Stream SQL
>> syntax separately, there would come out two different syntaxes which
>> represent the same use case.
>> > Therefore, I am wondering if it is possible to unify such work, i.e.
>> design and compliment the calcite Streaming SQL to enrich window definition
>> so that both storm and flink can reuse the calcite(Streaming SQL) as their
>> SQL parser for streaming cases with little change.
>> > What do you think about this idea?
>> >
>>
>>

Re: About Stream SQL

Posted by Stephan Ewen <se...@apache.org>.

Hi!

True, the Flink community is looking into stream SQL, and is currently
building on top of Calcite. This is all going well, but we probably need
some custom syntax around windowing.

For Stream SQL Windowing, what I have seen so far in Calcite (correct me if
I am wrong there), is pretty much a variant of the OLAP sliding window
aggregates.

  - Windows are in those basically calculated by rounding down/up
timestamps, thus bucketizing the events. That works for many cases, but is
quite tricky syntax.

  - Flink supports various notions of time for windowing (processing time,
ingestion time, event time), as well as triggers. To be able to extend the
window specification with such additional parameters is pretty crucial and
would probably go well with a dedicated window clause.

  - Flink also has unaligned windows (sessions, timeouts, ...) which are
very hard to map to grouping and window aggregations across ordered groups.


Converging to a core standard around stream SQL is very desirable, I
completely agree.
For the basic constructs, I think this is quite feasible and Calcite has
some good suggestions there.

In the advanced constructs, the systems differ quite heavily currently, so
converging there may be harder there. Also, we are just learning what
semantics people need concerning windowing/event time/etc. May almost be a
tad bit too early to try and define a standard there...


Greetings,
Stephan


On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <jh...@apache.org> wrote:

> I totally agree with you. (Sorry for the delayed response; this week has
> been very busy.)
>
> There is a tendency of vendors (and projects) to think that their
> technology is unique, and superior to everyone else’s, and want to showcase
> it in their dialect of SQL. That is natural, and it’s OK, since it makes
> them strive to make their technology better.
>
> However, they have to remember that the end users don’t want something
> unique, they want something that solves their problem. They would like
> something that is standards compliant so that it is easy to learn, easy to
> hire developers for, and — if the worst comes to the worst — easy to
> migrate to a compatible competing technology.
>
> I know the developers at Storm and Flink (and Samza too) and they
> understand the importance of collaborating on a standard.
>
> I have been trying to play a dual role: supplying the parser and planner
> for streaming SQL, and also to facilitate the creation of a standard
> language and semantics of streaming SQL. For the latter, see Streaming page
> on Calcite’s web site[1]. On that page, I intend to illustrate all of the
> main patterns of streaming queries, give them names (e.g. “Tumbling
> windows”), and show how those translate into streaming SQL.
>
> Also, it would be useful to create a reference implementation of streaming
> SQL in Calcite so that you can validate and run queries. The performance,
> scalability and reliability will not be the same as if you ran Storm, Flink
> or Samza, but at least you can see what the semantics should be.
>
> I believe that most, if not all, of the examples that the projects are
> coming up with can be translated into SQL. It will be challenging, because
> we want to preserve the semantics of SQL, allow streaming SQL to
> interoperate with traditional relations, and also retain the general look
> and feel of SQL. (For example, I fought quite hard[2] recently for the
> principle that GROUP BY defines a partition (in the set-theory sense)[3]
> and therefore could not be used to represent a tumbling window, until I
> remembered that GROUPING SETS already allows each input row to appear in
> more than one output sub-total.)
>
> What can you, the users, do? Get involved in the discussion about what you
> want in the language. Encourage the projects to bring their proposed SQL
> features into this forum for discussion, and add to the list of patterns
> and examples on the Streaming page. As in any standards process, the users
> help to keep the vendors focused.
>
> I’ll be talking about streaming SQL, planning, and standardization at the
> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
> stop by.
>
> Julian
>
> [1] http://calcite.apache.org/docs/stream.html
>
> [2]
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>
> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
>
> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
>
> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <la...@huawei.com>
> wrote:
> >
> > Hi to all,
> >
> > I am from Huawei and am focusing on data stream processing.
> > Recently I noticed that both in Storm community and Flink community
> there are endeavors to user Calcite as SQL parser to enable Storm/Flink to
> support SQL. They both want to supplemented or clarify Streaming SQL of
> calcite, especially the definition of windows.
> > I am considering if both communities working on designing Stream SQL
> syntax separately, there would come out two different syntaxes which
> represent the same use case.
> > Therefore, I am wondering if it is possible to unify such work, i.e.
> design and compliment the calcite Streaming SQL to enrich window definition
> so that both storm and flink can reuse the calcite(Streaming SQL) as their
> SQL parser for streaming cases with little change.
> > What do you think about this idea?
> >
>
>