You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Tyler Akidau <ta...@google.com.INVALID> on 2017/04/21 00:57:42 UTC

Towards a spec for robust streaming SQL, Part 1

Hello Beam, Calcite, and Flink dev lists!

Apologies for the big cross post, but I thought this might be something all
three communities would find relevant.

Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
Mingmin Xu. As you can imagine, we need to come to some conclusion about
how to elegantly support the full suite of streaming functionality in the
Beam model in via Calcite SQL. You folks in the Flink community have been
pushing on this (e.g., adding windowing constructs, amongst others, thank
you! :-), but from my understanding we still don't have a full spec for how
to support robust streaming in SQL (including but not limited to, e.g., a
triggers analogue such as EMIT).

I've been spending a lot of time thinking about this and have some opinions
about how I think it should look that I've already written down, so I
volunteered to try to drive forward agreement on a general streaming SQL
spec between our three communities (well, technically I volunteered to do
that w/ Beam and Calcite, but I figured you Flink folks might want to join
in since you're going that direction already anyway and will have useful
insights :-).

My plan was to do this by sharing two docs:

   1. The Beam Model : Streams & Tables - This one is for context, and
   really only mentions SQL in passing. But it describes the relationship
   between the Beam Model and the "streams & tables" way of thinking, which
   turns out to be useful in understanding what robust streaming in SQL might
   look like. Many of you probably already know some or all of what's in here,
   but I felt it was necessary to have it all written down in order to justify
   some of the proposals I wanted to make in the second doc.

   2. A streaming SQL spec for Calcite - The goal for this doc is that it
   would become a general specification for what robust streaming SQL in
   Calcite should look like. It would start out as a basic proposal of what
   things *could* look like (combining both what things look like now as well
   as a set of proposed changes for the future), and we could all iterate on
   it together until we get to something we're happy with.

At this point, I have doc #1 ready, and it's a bit of a monster, so I
figured I'd share it and let folks hack at it with comments if they have
any, while I try to get the second doc ready in the meantime. As part of
getting doc #2 ready, I'll be starting a separate thread to try to gather
input on what things are already in flight for streaming SQL across the
various communities, to make sure the proposal captures everything that's
going on as accurately as it can.

If you have any questions or comments, I'm interested to hear them.
Otherwise, here's doc #1, "The Beam Model : Streams & Tables":

  http://s.apache.org/beam-streams-tables

-Tyler

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Tyler,
Yes, dynamic table changes over time. You can find more details about
dynamic table from this Flink blog (
https://flink.apache.org/news/2017/04/04/dynamic-tables.html). Fabian, me
and Xiaowei posted it a week before the flink-forward@SF.
"A dynamic table is a table that is continuously updated and can be queried
like a regular, static table. However, in contrast to a query on a batch
table which terminates and returns a static table as result, a query on a
dynamic table runs continuously and produces a table that is continuously
updated depending on the modification on the input table. Hence, the
resulting table is a dynamic table as well."

Regards,
Shaoxuan


On Sat, May 13, 2017 at 1:05 AM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Being able to support an EMIT config independent of the query itself sounds
> great for compatible use cases (which should be many :-).
>
> Shaoxuan, can you please refresh my memory what a dynamic table means in
> Flink? It's basically just a state table, right? The "dynamic" part of the
> name is to simply to imply it can evolve and change over time?
>
> -Tyler
>
>
> On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:
>
> > Thanks to Tyler and Fabian for sharing your thoughts.
> >
> > Regarding to the early/late update control of FLINK. IMO, each dynamic
> > table can have an EMIT config. For FLINK table-API, this can be easily
> > implemented in different manners, case by case. For instance, in window
> > aggregate, we could define "when EMIT a result" via a windowConf per each
> > window when we create windows. Unfortunately we do not have such
> > flexibility (as compared with TableAPI) in SQL query, we need find a way
> to
> > embed this EMIT config.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Tyler,
> > > > >
> > > > > thank you very much for this excellent write-up and the super nice
> > > > > visualizations!
> > > > > You are discussing a lot of the things that we have been thinking
> > about
> > > > as
> > > > > well from a different perspective.
> > > > > IMO, yours and the Flink model are pretty much aligned although we
> > use
> > > a
> > > > > different terminology (which is not yet completely established). So
> > > there
> > > > > should be room for unification ;-)
> > > > >
> > > >
> > > > Good to hear, thanks for giving it a look. :-)
> > > >
> > > >
> > > > > Allow me a few words on the current state in Flink. In the upcoming
> > > 1.3.0
> > > > > release, we will have support for group window (TUMBLE, HOP,
> > SESSION),
> > > > OVER
> > > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > > aggregations. The group windows are triggered by watermark and the
> > over
> > > > > window and non-windowed aggregations emit for each input record
> > > > > (AtCount(1)). The window aggregations do not support early or late
> > > firing
> > > > > (late records are dropped), so no updates here. However, the
> > > non-windowed
> > > > > aggregates produce updates (in acc and acc/retract mode). Based on
> > this
> > > > we
> > > > > will work on better control for late updates and early firing as
> well
> > > as
> > > > > joins in the next release.
> > > > >
> > > >
> > > > Impressive. At this rate there's a good chance we'll just be doing
> > > catchup
> > > > and thanking you for building everything. ;-) Do you have ideas for
> > what
> > > > you want your early/late updates control to look like? That's one of
> > the
> > > > areas I'd like to see better defined for us. And how deep are you
> going
> > > > with joins?
> > > >
> > >
> > > Right now (well actually I merged the change 1h ago) we are using a
> > > QueryConfig object to specify state retention intervals to be able to
> > clean
> > > up state for inactive keys.
> > > A running a query looks like this:
> > >
> > > // -------
> > > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > > val tEnv = TableEnvironment.getTableEnvironment(env)
> > >
> > > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(
> Time.hours(12))
> > //
> > > state of inactive keys is kept for 12 hours
> > >
> > > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP
> BY
> > > a")
> > > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf)
> //
> > > boolean flag for acc/retract
> > >
> > > env.execute()
> > > // -------
> > >
> > > We plan to use the QueryConfig also to specify early/late updates. Our
> > main
> > > motivation is to have a uniform and standard SQL for batch and
> streaming.
> > > Hence, we have to move the configuration out of the query. But I agree,
> > > that it would be very nice to be able to include it in the query. I
> think
> > > it should not be difficult to optionally support an EMIT clause as
> well.
> > >
> > >
> > > >
> > > > > Reading the document, I did not find any major difference in our
> > > > concepts.
> > > > > In fact, we are aiming to support the cases you are describing as
> > well.
> > > > > I have a question though. Would you classify an OVER aggregation
> as a
> > > > > stream -> stream or stream -> table operation? It collects records
> to
> > > > > aggregate them, but emits one record for each input row. Depending
> on
> > > the
> > > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can
> compute
> > > and
> > > > > emit the result record when the input record is received.
> > > > >
> > > >
> > > > I would call it a composite stream → stream operation (since SQL,
> like
> > > the
> > > > standard Beam/Flink APIs, is a higher level set of constructs than
> raw
> > > > streams/tables operations) consisting of a stream → table windowed
> > > grouping
> > > > followed by a table → stream triggering on every element, basically
> as
> > > you
> > > > described in the previous paragraph.
> > > >
> > > >
> > > That makes sense. Thanks :-)
> > >
> > >
> > > > -Tyler
> > > >
> > > >
> > > > >
> > > > > I'm looking forward to the second part.
> > > > >
> > > > > Cheers, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > > >
> > > > > > Any thoughts here Fabian? I'm planning to start sending out some
> > more
> > > > > > emails towards the end of the week.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <takidau@google.com
> >
> > > > wrote:
> > > > > >
> > > > > > > No worries, thanks for the heads up. Good luck wrapping all
> that
> > > > stuff
> > > > > > up.
> > > > > > >
> > > > > > > -Tyler
> > > > > > >
> > > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> > fhueske@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Tyler,
> > > > > > >>
> > > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > > >> I haven't managed to read the doc yet, but just wanted to
> thank
> > > you
> > > > > for
> > > > > > >> the
> > > > > > >> write-up and let you know that I'm very interested in this
> > > > discussion.
> > > > > > >>
> > > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> > quite
> > > > > busy
> > > > > > >> getting as many contributions merged before the release is
> > forked
> > > > off.
> > > > > > >> When that happened, I'll have more time to read and comment.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Fabian
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> > <takidau@google.com.invalid
> > > > >:
> > > > > > >>
> > > > > > >> > Good point, when you start talking about anything less than
> a
> > > full
> > > > > > join,
> > > > > > >> > triggers get involved to describe how one actually achieves
> > the
> > > > > > desired
> > > > > > >> > semantics, and they may end up being tied to just one of the
> > > > inputs
> > > > > > >> (e.g.,
> > > > > > >> > you may only care about the watermark for one side of the
> > join).
> > > > Am
> > > > > > >> > expecting us to address these sorts of details more
> precisely
> > in
> > > > doc
> > > > > > #2.
> > > > > > >> >
> > > > > > >> > -Tyler
> > > > > > >> >
> > > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > > <klk@google.com.invalid
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > There's something to be said about having different
> > triggering
> > > > > > >> depending
> > > > > > >> > on
> > > > > > >> > > which side of a join data comes from, perhaps?
> > > > > > >> > >
> > > > > > >> > > (delightful doc, as usual)
> > > > > > >> > >
> > > > > > >> > > Kenn
> > > > > > >> > >
> > > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > > >> <takidau@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Thanks for reading, Luke. The simple answer is that
> CoGBK
> > is
> > > > > > >> basically
> > > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > > merges
> > > > > the
> > > > > > >> > input
> > > > > > >> > > > streams into a single output stream. GBK then groups the
> > > data
> > > > > > within
> > > > > > >> > that
> > > > > > >> > > > single union stream as you might otherwise expect,
> > yielding
> > > a
> > > > > > single
> > > > > > >> > > table.
> > > > > > >> > > > So I think it doesn't really impact things much.
> Grouping,
> > > > > > >> aggregation,
> > > > > > >> > > > window merging etc all just act upon the merged stream
> and
> > > > > > generate
> > > > > > >> > what
> > > > > > >> > > is
> > > > > > >> > > > effectively a merged table.
> > > > > > >> > > >
> > > > > > >> > > > -Tyler
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > > >> <lcwik@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The doc is a good read.
> > > > > > >> > > > >
> > > > > > >> > > > > I think you do a great job of explaining table ->
> > stream,
> > > > > stream
> > > > > > >> ->
> > > > > > >> > > > stream,
> > > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > > >> > > > > But when there are multiple streams reading/writing
> to a
> > > > > table,
> > > > > > >> how
> > > > > > >> > > does
> > > > > > >> > > > > that impact what occurs?
> > > > > > >> > > > > For example, with CoGBK you have multiple streams
> > writing
> > > > to a
> > > > > > >> table,
> > > > > > >> > > how
> > > > > > >> > > > > does that impact window merging?
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > > >> > > <takidau@google.com.invalid
> > > > > > >> > > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > > >> > > > > >
> > > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > > might
> > > > > be
> > > > > > >> > > something
> > > > > > >> > > > > all
> > > > > > >> > > > > > three communities would find relevant.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Beam is finally making progress on a SQL DSL
> utilizing
> > > > > > Calcite,
> > > > > > >> > > thanks
> > > > > > >> > > > to
> > > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> > some
> > > > > > >> conclusion
> > > > > > >> > > > about
> > > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > > >> functionality
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the
> Flink
> > > > > > community
> > > > > > >> > have
> > > > > > >> > > > been
> > > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > > amongst
> > > > > > >> others,
> > > > > > >> > > > thank
> > > > > > >> > > > > > you! :-), but from my understanding we still don't
> > have
> > > a
> > > > > full
> > > > > > >> spec
> > > > > > >> > > for
> > > > > > >> > > > > how
> > > > > > >> > > > > > to support robust streaming in SQL (including but
> not
> > > > > limited
> > > > > > >> to,
> > > > > > >> > > > e.g., a
> > > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > > >> > > > > >
> > > > > > >> > > > > > I've been spending a lot of time thinking about this
> > and
> > > > > have
> > > > > > >> some
> > > > > > >> > > > > opinions
> > > > > > >> > > > > > about how I think it should look that I've already
> > > written
> > > > > > down,
> > > > > > >> > so I
> > > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > > general
> > > > > > >> > streaming
> > > > > > >> > > > SQL
> > > > > > >> > > > > > spec between our three communities (well,
> technically
> > I
> > > > > > >> volunteered
> > > > > > >> > > to
> > > > > > >> > > > do
> > > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> > folks
> > > > > might
> > > > > > >> want
> > > > > > >> > to
> > > > > > >> > > > > join
> > > > > > >> > > > > > in since you're going that direction already anyway
> > and
> > > > will
> > > > > > >> have
> > > > > > >> > > > useful
> > > > > > >> > > > > > insights :-).
> > > > > > >> > > > > >
> > > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one
> is
> > > for
> > > > > > >> context,
> > > > > > >> > > and
> > > > > > >> > > > > >    really only mentions SQL in passing. But it
> > describes
> > > > the
> > > > > > >> > > > relationship
> > > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> > way
> > > > of
> > > > > > >> > thinking,
> > > > > > >> > > > > which
> > > > > > >> > > > > >    turns out to be useful in understanding what
> robust
> > > > > > >> streaming in
> > > > > > >> > > SQL
> > > > > > >> > > > > > might
> > > > > > >> > > > > >    look like. Many of you probably already know some
> > or
> > > > all
> > > > > of
> > > > > > >> > what's
> > > > > > >> > > > in
> > > > > > >> > > > > > here,
> > > > > > >> > > > > >    but I felt it was necessary to have it all
> written
> > > down
> > > > > in
> > > > > > >> order
> > > > > > >> > > to
> > > > > > >> > > > > > justify
> > > > > > >> > > > > >    some of the proposals I wanted to make in the
> > second
> > > > doc.
> > > > > > >> > > > > >
> > > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal
> for
> > > this
> > > > > doc
> > > > > > >> is
> > > > > > >> > > that
> > > > > > >> > > > it
> > > > > > >> > > > > >    would become a general specification for what
> > robust
> > > > > > >> streaming
> > > > > > >> > SQL
> > > > > > >> > > > in
> > > > > > >> > > > > >    Calcite should look like. It would start out as a
> > > basic
> > > > > > >> proposal
> > > > > > >> > > of
> > > > > > >> > > > > what
> > > > > > >> > > > > >    things *could* look like (combining both what
> > things
> > > > look
> > > > > > >> like
> > > > > > >> > now
> > > > > > >> > > > as
> > > > > > >> > > > > > well
> > > > > > >> > > > > >    as a set of proposed changes for the future), and
> > we
> > > > > could
> > > > > > >> all
> > > > > > >> > > > iterate
> > > > > > >> > > > > > on
> > > > > > >> > > > > >    it together until we get to something we're happy
> > > with.
> > > > > > >> > > > > >
> > > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit
> of
> > a
> > > > > > monster,
> > > > > > >> > so I
> > > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > > comments
> > > > > if
> > > > > > >> they
> > > > > > >> > > > have
> > > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > > meantime.
> > > > > > As
> > > > > > >> > part
> > > > > > >> > > > of
> > > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> > thread
> > > > to
> > > > > > try
> > > > > > >> to
> > > > > > >> > > > gather
> > > > > > >> > > > > > input on what things are already in flight for
> > streaming
> > > > SQL
> > > > > > >> across
> > > > > > >> > > the
> > > > > > >> > > > > > various communities, to make sure the proposal
> > captures
> > > > > > >> everything
> > > > > > >> > > > that's
> > > > > > >> > > > > > going on as accurately as it can.
> > > > > > >> > > > > >
> > > > > > >> > > > > > If you have any questions or comments, I'm
> interested
> > to
> > > > > hear
> > > > > > >> them.
> > > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams
> &
> > > > > Tables":
> > > > > > >> > > > > >
> > > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Tyler
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Tyler,
Yes, dynamic table changes over time. You can find more details about
dynamic table from this Flink blog (
https://flink.apache.org/news/2017/04/04/dynamic-tables.html). Fabian, me
and Xiaowei posted it a week before the flink-forward@SF.
"A dynamic table is a table that is continuously updated and can be queried
like a regular, static table. However, in contrast to a query on a batch
table which terminates and returns a static table as result, a query on a
dynamic table runs continuously and produces a table that is continuously
updated depending on the modification on the input table. Hence, the
resulting table is a dynamic table as well."

Regards,
Shaoxuan


On Sat, May 13, 2017 at 1:05 AM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Being able to support an EMIT config independent of the query itself sounds
> great for compatible use cases (which should be many :-).
>
> Shaoxuan, can you please refresh my memory what a dynamic table means in
> Flink? It's basically just a state table, right? The "dynamic" part of the
> name is to simply to imply it can evolve and change over time?
>
> -Tyler
>
>
> On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:
>
> > Thanks to Tyler and Fabian for sharing your thoughts.
> >
> > Regarding to the early/late update control of FLINK. IMO, each dynamic
> > table can have an EMIT config. For FLINK table-API, this can be easily
> > implemented in different manners, case by case. For instance, in window
> > aggregate, we could define "when EMIT a result" via a windowConf per each
> > window when we create windows. Unfortunately we do not have such
> > flexibility (as compared with TableAPI) in SQL query, we need find a way
> to
> > embed this EMIT config.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Tyler,
> > > > >
> > > > > thank you very much for this excellent write-up and the super nice
> > > > > visualizations!
> > > > > You are discussing a lot of the things that we have been thinking
> > about
> > > > as
> > > > > well from a different perspective.
> > > > > IMO, yours and the Flink model are pretty much aligned although we
> > use
> > > a
> > > > > different terminology (which is not yet completely established). So
> > > there
> > > > > should be room for unification ;-)
> > > > >
> > > >
> > > > Good to hear, thanks for giving it a look. :-)
> > > >
> > > >
> > > > > Allow me a few words on the current state in Flink. In the upcoming
> > > 1.3.0
> > > > > release, we will have support for group window (TUMBLE, HOP,
> > SESSION),
> > > > OVER
> > > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > > aggregations. The group windows are triggered by watermark and the
> > over
> > > > > window and non-windowed aggregations emit for each input record
> > > > > (AtCount(1)). The window aggregations do not support early or late
> > > firing
> > > > > (late records are dropped), so no updates here. However, the
> > > non-windowed
> > > > > aggregates produce updates (in acc and acc/retract mode). Based on
> > this
> > > > we
> > > > > will work on better control for late updates and early firing as
> well
> > > as
> > > > > joins in the next release.
> > > > >
> > > >
> > > > Impressive. At this rate there's a good chance we'll just be doing
> > > catchup
> > > > and thanking you for building everything. ;-) Do you have ideas for
> > what
> > > > you want your early/late updates control to look like? That's one of
> > the
> > > > areas I'd like to see better defined for us. And how deep are you
> going
> > > > with joins?
> > > >
> > >
> > > Right now (well actually I merged the change 1h ago) we are using a
> > > QueryConfig object to specify state retention intervals to be able to
> > clean
> > > up state for inactive keys.
> > > A running a query looks like this:
> > >
> > > // -------
> > > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > > val tEnv = TableEnvironment.getTableEnvironment(env)
> > >
> > > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(
> Time.hours(12))
> > //
> > > state of inactive keys is kept for 12 hours
> > >
> > > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP
> BY
> > > a")
> > > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf)
> //
> > > boolean flag for acc/retract
> > >
> > > env.execute()
> > > // -------
> > >
> > > We plan to use the QueryConfig also to specify early/late updates. Our
> > main
> > > motivation is to have a uniform and standard SQL for batch and
> streaming.
> > > Hence, we have to move the configuration out of the query. But I agree,
> > > that it would be very nice to be able to include it in the query. I
> think
> > > it should not be difficult to optionally support an EMIT clause as
> well.
> > >
> > >
> > > >
> > > > > Reading the document, I did not find any major difference in our
> > > > concepts.
> > > > > In fact, we are aiming to support the cases you are describing as
> > well.
> > > > > I have a question though. Would you classify an OVER aggregation
> as a
> > > > > stream -> stream or stream -> table operation? It collects records
> to
> > > > > aggregate them, but emits one record for each input row. Depending
> on
> > > the
> > > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can
> compute
> > > and
> > > > > emit the result record when the input record is received.
> > > > >
> > > >
> > > > I would call it a composite stream → stream operation (since SQL,
> like
> > > the
> > > > standard Beam/Flink APIs, is a higher level set of constructs than
> raw
> > > > streams/tables operations) consisting of a stream → table windowed
> > > grouping
> > > > followed by a table → stream triggering on every element, basically
> as
> > > you
> > > > described in the previous paragraph.
> > > >
> > > >
> > > That makes sense. Thanks :-)
> > >
> > >
> > > > -Tyler
> > > >
> > > >
> > > > >
> > > > > I'm looking forward to the second part.
> > > > >
> > > > > Cheers, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > > >
> > > > > > Any thoughts here Fabian? I'm planning to start sending out some
> > more
> > > > > > emails towards the end of the week.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <takidau@google.com
> >
> > > > wrote:
> > > > > >
> > > > > > > No worries, thanks for the heads up. Good luck wrapping all
> that
> > > > stuff
> > > > > > up.
> > > > > > >
> > > > > > > -Tyler
> > > > > > >
> > > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> > fhueske@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Tyler,
> > > > > > >>
> > > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > > >> I haven't managed to read the doc yet, but just wanted to
> thank
> > > you
> > > > > for
> > > > > > >> the
> > > > > > >> write-up and let you know that I'm very interested in this
> > > > discussion.
> > > > > > >>
> > > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> > quite
> > > > > busy
> > > > > > >> getting as many contributions merged before the release is
> > forked
> > > > off.
> > > > > > >> When that happened, I'll have more time to read and comment.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Fabian
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> > <takidau@google.com.invalid
> > > > >:
> > > > > > >>
> > > > > > >> > Good point, when you start talking about anything less than
> a
> > > full
> > > > > > join,
> > > > > > >> > triggers get involved to describe how one actually achieves
> > the
> > > > > > desired
> > > > > > >> > semantics, and they may end up being tied to just one of the
> > > > inputs
> > > > > > >> (e.g.,
> > > > > > >> > you may only care about the watermark for one side of the
> > join).
> > > > Am
> > > > > > >> > expecting us to address these sorts of details more
> precisely
> > in
> > > > doc
> > > > > > #2.
> > > > > > >> >
> > > > > > >> > -Tyler
> > > > > > >> >
> > > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > > <klk@google.com.invalid
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > There's something to be said about having different
> > triggering
> > > > > > >> depending
> > > > > > >> > on
> > > > > > >> > > which side of a join data comes from, perhaps?
> > > > > > >> > >
> > > > > > >> > > (delightful doc, as usual)
> > > > > > >> > >
> > > > > > >> > > Kenn
> > > > > > >> > >
> > > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > > >> <takidau@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Thanks for reading, Luke. The simple answer is that
> CoGBK
> > is
> > > > > > >> basically
> > > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > > merges
> > > > > the
> > > > > > >> > input
> > > > > > >> > > > streams into a single output stream. GBK then groups the
> > > data
> > > > > > within
> > > > > > >> > that
> > > > > > >> > > > single union stream as you might otherwise expect,
> > yielding
> > > a
> > > > > > single
> > > > > > >> > > table.
> > > > > > >> > > > So I think it doesn't really impact things much.
> Grouping,
> > > > > > >> aggregation,
> > > > > > >> > > > window merging etc all just act upon the merged stream
> and
> > > > > > generate
> > > > > > >> > what
> > > > > > >> > > is
> > > > > > >> > > > effectively a merged table.
> > > > > > >> > > >
> > > > > > >> > > > -Tyler
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > > >> <lcwik@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The doc is a good read.
> > > > > > >> > > > >
> > > > > > >> > > > > I think you do a great job of explaining table ->
> > stream,
> > > > > stream
> > > > > > >> ->
> > > > > > >> > > > stream,
> > > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > > >> > > > > But when there are multiple streams reading/writing
> to a
> > > > > table,
> > > > > > >> how
> > > > > > >> > > does
> > > > > > >> > > > > that impact what occurs?
> > > > > > >> > > > > For example, with CoGBK you have multiple streams
> > writing
> > > > to a
> > > > > > >> table,
> > > > > > >> > > how
> > > > > > >> > > > > does that impact window merging?
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > > >> > > <takidau@google.com.invalid
> > > > > > >> > > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > > >> > > > > >
> > > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > > might
> > > > > be
> > > > > > >> > > something
> > > > > > >> > > > > all
> > > > > > >> > > > > > three communities would find relevant.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Beam is finally making progress on a SQL DSL
> utilizing
> > > > > > Calcite,
> > > > > > >> > > thanks
> > > > > > >> > > > to
> > > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> > some
> > > > > > >> conclusion
> > > > > > >> > > > about
> > > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > > >> functionality
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the
> Flink
> > > > > > community
> > > > > > >> > have
> > > > > > >> > > > been
> > > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > > amongst
> > > > > > >> others,
> > > > > > >> > > > thank
> > > > > > >> > > > > > you! :-), but from my understanding we still don't
> > have
> > > a
> > > > > full
> > > > > > >> spec
> > > > > > >> > > for
> > > > > > >> > > > > how
> > > > > > >> > > > > > to support robust streaming in SQL (including but
> not
> > > > > limited
> > > > > > >> to,
> > > > > > >> > > > e.g., a
> > > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > > >> > > > > >
> > > > > > >> > > > > > I've been spending a lot of time thinking about this
> > and
> > > > > have
> > > > > > >> some
> > > > > > >> > > > > opinions
> > > > > > >> > > > > > about how I think it should look that I've already
> > > written
> > > > > > down,
> > > > > > >> > so I
> > > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > > general
> > > > > > >> > streaming
> > > > > > >> > > > SQL
> > > > > > >> > > > > > spec between our three communities (well,
> technically
> > I
> > > > > > >> volunteered
> > > > > > >> > > to
> > > > > > >> > > > do
> > > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> > folks
> > > > > might
> > > > > > >> want
> > > > > > >> > to
> > > > > > >> > > > > join
> > > > > > >> > > > > > in since you're going that direction already anyway
> > and
> > > > will
> > > > > > >> have
> > > > > > >> > > > useful
> > > > > > >> > > > > > insights :-).
> > > > > > >> > > > > >
> > > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one
> is
> > > for
> > > > > > >> context,
> > > > > > >> > > and
> > > > > > >> > > > > >    really only mentions SQL in passing. But it
> > describes
> > > > the
> > > > > > >> > > > relationship
> > > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> > way
> > > > of
> > > > > > >> > thinking,
> > > > > > >> > > > > which
> > > > > > >> > > > > >    turns out to be useful in understanding what
> robust
> > > > > > >> streaming in
> > > > > > >> > > SQL
> > > > > > >> > > > > > might
> > > > > > >> > > > > >    look like. Many of you probably already know some
> > or
> > > > all
> > > > > of
> > > > > > >> > what's
> > > > > > >> > > > in
> > > > > > >> > > > > > here,
> > > > > > >> > > > > >    but I felt it was necessary to have it all
> written
> > > down
> > > > > in
> > > > > > >> order
> > > > > > >> > > to
> > > > > > >> > > > > > justify
> > > > > > >> > > > > >    some of the proposals I wanted to make in the
> > second
> > > > doc.
> > > > > > >> > > > > >
> > > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal
> for
> > > this
> > > > > doc
> > > > > > >> is
> > > > > > >> > > that
> > > > > > >> > > > it
> > > > > > >> > > > > >    would become a general specification for what
> > robust
> > > > > > >> streaming
> > > > > > >> > SQL
> > > > > > >> > > > in
> > > > > > >> > > > > >    Calcite should look like. It would start out as a
> > > basic
> > > > > > >> proposal
> > > > > > >> > > of
> > > > > > >> > > > > what
> > > > > > >> > > > > >    things *could* look like (combining both what
> > things
> > > > look
> > > > > > >> like
> > > > > > >> > now
> > > > > > >> > > > as
> > > > > > >> > > > > > well
> > > > > > >> > > > > >    as a set of proposed changes for the future), and
> > we
> > > > > could
> > > > > > >> all
> > > > > > >> > > > iterate
> > > > > > >> > > > > > on
> > > > > > >> > > > > >    it together until we get to something we're happy
> > > with.
> > > > > > >> > > > > >
> > > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit
> of
> > a
> > > > > > monster,
> > > > > > >> > so I
> > > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > > comments
> > > > > if
> > > > > > >> they
> > > > > > >> > > > have
> > > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > > meantime.
> > > > > > As
> > > > > > >> > part
> > > > > > >> > > > of
> > > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> > thread
> > > > to
> > > > > > try
> > > > > > >> to
> > > > > > >> > > > gather
> > > > > > >> > > > > > input on what things are already in flight for
> > streaming
> > > > SQL
> > > > > > >> across
> > > > > > >> > > the
> > > > > > >> > > > > > various communities, to make sure the proposal
> > captures
> > > > > > >> everything
> > > > > > >> > > > that's
> > > > > > >> > > > > > going on as accurately as it can.
> > > > > > >> > > > > >
> > > > > > >> > > > > > If you have any questions or comments, I'm
> interested
> > to
> > > > > hear
> > > > > > >> them.
> > > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams
> &
> > > > > Tables":
> > > > > > >> > > > > >
> > > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Tyler
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Tyler,
Yes, dynamic table changes over time. You can find more details about
dynamic table from this Flink blog (
https://flink.apache.org/news/2017/04/04/dynamic-tables.html). Fabian, me
and Xiaowei posted it a week before the flink-forward@SF.
"A dynamic table is a table that is continuously updated and can be queried
like a regular, static table. However, in contrast to a query on a batch
table which terminates and returns a static table as result, a query on a
dynamic table runs continuously and produces a table that is continuously
updated depending on the modification on the input table. Hence, the
resulting table is a dynamic table as well."

Regards,
Shaoxuan


On Sat, May 13, 2017 at 1:05 AM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Being able to support an EMIT config independent of the query itself sounds
> great for compatible use cases (which should be many :-).
>
> Shaoxuan, can you please refresh my memory what a dynamic table means in
> Flink? It's basically just a state table, right? The "dynamic" part of the
> name is to simply to imply it can evolve and change over time?
>
> -Tyler
>
>
> On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:
>
> > Thanks to Tyler and Fabian for sharing your thoughts.
> >
> > Regarding to the early/late update control of FLINK. IMO, each dynamic
> > table can have an EMIT config. For FLINK table-API, this can be easily
> > implemented in different manners, case by case. For instance, in window
> > aggregate, we could define "when EMIT a result" via a windowConf per each
> > window when we create windows. Unfortunately we do not have such
> > flexibility (as compared with TableAPI) in SQL query, we need find a way
> to
> > embed this EMIT config.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Tyler,
> > > > >
> > > > > thank you very much for this excellent write-up and the super nice
> > > > > visualizations!
> > > > > You are discussing a lot of the things that we have been thinking
> > about
> > > > as
> > > > > well from a different perspective.
> > > > > IMO, yours and the Flink model are pretty much aligned although we
> > use
> > > a
> > > > > different terminology (which is not yet completely established). So
> > > there
> > > > > should be room for unification ;-)
> > > > >
> > > >
> > > > Good to hear, thanks for giving it a look. :-)
> > > >
> > > >
> > > > > Allow me a few words on the current state in Flink. In the upcoming
> > > 1.3.0
> > > > > release, we will have support for group window (TUMBLE, HOP,
> > SESSION),
> > > > OVER
> > > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > > aggregations. The group windows are triggered by watermark and the
> > over
> > > > > window and non-windowed aggregations emit for each input record
> > > > > (AtCount(1)). The window aggregations do not support early or late
> > > firing
> > > > > (late records are dropped), so no updates here. However, the
> > > non-windowed
> > > > > aggregates produce updates (in acc and acc/retract mode). Based on
> > this
> > > > we
> > > > > will work on better control for late updates and early firing as
> well
> > > as
> > > > > joins in the next release.
> > > > >
> > > >
> > > > Impressive. At this rate there's a good chance we'll just be doing
> > > catchup
> > > > and thanking you for building everything. ;-) Do you have ideas for
> > what
> > > > you want your early/late updates control to look like? That's one of
> > the
> > > > areas I'd like to see better defined for us. And how deep are you
> going
> > > > with joins?
> > > >
> > >
> > > Right now (well actually I merged the change 1h ago) we are using a
> > > QueryConfig object to specify state retention intervals to be able to
> > clean
> > > up state for inactive keys.
> > > A running a query looks like this:
> > >
> > > // -------
> > > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > > val tEnv = TableEnvironment.getTableEnvironment(env)
> > >
> > > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(
> Time.hours(12))
> > //
> > > state of inactive keys is kept for 12 hours
> > >
> > > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP
> BY
> > > a")
> > > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf)
> //
> > > boolean flag for acc/retract
> > >
> > > env.execute()
> > > // -------
> > >
> > > We plan to use the QueryConfig also to specify early/late updates. Our
> > main
> > > motivation is to have a uniform and standard SQL for batch and
> streaming.
> > > Hence, we have to move the configuration out of the query. But I agree,
> > > that it would be very nice to be able to include it in the query. I
> think
> > > it should not be difficult to optionally support an EMIT clause as
> well.
> > >
> > >
> > > >
> > > > > Reading the document, I did not find any major difference in our
> > > > concepts.
> > > > > In fact, we are aiming to support the cases you are describing as
> > well.
> > > > > I have a question though. Would you classify an OVER aggregation
> as a
> > > > > stream -> stream or stream -> table operation? It collects records
> to
> > > > > aggregate them, but emits one record for each input row. Depending
> on
> > > the
> > > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can
> compute
> > > and
> > > > > emit the result record when the input record is received.
> > > > >
> > > >
> > > > I would call it a composite stream → stream operation (since SQL,
> like
> > > the
> > > > standard Beam/Flink APIs, is a higher level set of constructs than
> raw
> > > > streams/tables operations) consisting of a stream → table windowed
> > > grouping
> > > > followed by a table → stream triggering on every element, basically
> as
> > > you
> > > > described in the previous paragraph.
> > > >
> > > >
> > > That makes sense. Thanks :-)
> > >
> > >
> > > > -Tyler
> > > >
> > > >
> > > > >
> > > > > I'm looking forward to the second part.
> > > > >
> > > > > Cheers, Fabian
> > > > >
> > > > >
> > > > >
> > > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > > >
> > > > > > Any thoughts here Fabian? I'm planning to start sending out some
> > more
> > > > > > emails towards the end of the week.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <takidau@google.com
> >
> > > > wrote:
> > > > > >
> > > > > > > No worries, thanks for the heads up. Good luck wrapping all
> that
> > > > stuff
> > > > > > up.
> > > > > > >
> > > > > > > -Tyler
> > > > > > >
> > > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> > fhueske@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Tyler,
> > > > > > >>
> > > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > > >> I haven't managed to read the doc yet, but just wanted to
> thank
> > > you
> > > > > for
> > > > > > >> the
> > > > > > >> write-up and let you know that I'm very interested in this
> > > > discussion.
> > > > > > >>
> > > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> > quite
> > > > > busy
> > > > > > >> getting as many contributions merged before the release is
> > forked
> > > > off.
> > > > > > >> When that happened, I'll have more time to read and comment.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Fabian
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> > <takidau@google.com.invalid
> > > > >:
> > > > > > >>
> > > > > > >> > Good point, when you start talking about anything less than
> a
> > > full
> > > > > > join,
> > > > > > >> > triggers get involved to describe how one actually achieves
> > the
> > > > > > desired
> > > > > > >> > semantics, and they may end up being tied to just one of the
> > > > inputs
> > > > > > >> (e.g.,
> > > > > > >> > you may only care about the watermark for one side of the
> > join).
> > > > Am
> > > > > > >> > expecting us to address these sorts of details more
> precisely
> > in
> > > > doc
> > > > > > #2.
> > > > > > >> >
> > > > > > >> > -Tyler
> > > > > > >> >
> > > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > > <klk@google.com.invalid
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > There's something to be said about having different
> > triggering
> > > > > > >> depending
> > > > > > >> > on
> > > > > > >> > > which side of a join data comes from, perhaps?
> > > > > > >> > >
> > > > > > >> > > (delightful doc, as usual)
> > > > > > >> > >
> > > > > > >> > > Kenn
> > > > > > >> > >
> > > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > > >> <takidau@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Thanks for reading, Luke. The simple answer is that
> CoGBK
> > is
> > > > > > >> basically
> > > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > > merges
> > > > > the
> > > > > > >> > input
> > > > > > >> > > > streams into a single output stream. GBK then groups the
> > > data
> > > > > > within
> > > > > > >> > that
> > > > > > >> > > > single union stream as you might otherwise expect,
> > yielding
> > > a
> > > > > > single
> > > > > > >> > > table.
> > > > > > >> > > > So I think it doesn't really impact things much.
> Grouping,
> > > > > > >> aggregation,
> > > > > > >> > > > window merging etc all just act upon the merged stream
> and
> > > > > > generate
> > > > > > >> > what
> > > > > > >> > > is
> > > > > > >> > > > effectively a merged table.
> > > > > > >> > > >
> > > > > > >> > > > -Tyler
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > > >> <lcwik@google.com.invalid
> > > > > > >> > >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The doc is a good read.
> > > > > > >> > > > >
> > > > > > >> > > > > I think you do a great job of explaining table ->
> > stream,
> > > > > stream
> > > > > > >> ->
> > > > > > >> > > > stream,
> > > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > > >> > > > > But when there are multiple streams reading/writing
> to a
> > > > > table,
> > > > > > >> how
> > > > > > >> > > does
> > > > > > >> > > > > that impact what occurs?
> > > > > > >> > > > > For example, with CoGBK you have multiple streams
> > writing
> > > > to a
> > > > > > >> table,
> > > > > > >> > > how
> > > > > > >> > > > > does that impact window merging?
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > > >> > > <takidau@google.com.invalid
> > > > > > >> > > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > > >> > > > > >
> > > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > > might
> > > > > be
> > > > > > >> > > something
> > > > > > >> > > > > all
> > > > > > >> > > > > > three communities would find relevant.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Beam is finally making progress on a SQL DSL
> utilizing
> > > > > > Calcite,
> > > > > > >> > > thanks
> > > > > > >> > > > to
> > > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> > some
> > > > > > >> conclusion
> > > > > > >> > > > about
> > > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > > >> functionality
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the
> Flink
> > > > > > community
> > > > > > >> > have
> > > > > > >> > > > been
> > > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > > amongst
> > > > > > >> others,
> > > > > > >> > > > thank
> > > > > > >> > > > > > you! :-), but from my understanding we still don't
> > have
> > > a
> > > > > full
> > > > > > >> spec
> > > > > > >> > > for
> > > > > > >> > > > > how
> > > > > > >> > > > > > to support robust streaming in SQL (including but
> not
> > > > > limited
> > > > > > >> to,
> > > > > > >> > > > e.g., a
> > > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > > >> > > > > >
> > > > > > >> > > > > > I've been spending a lot of time thinking about this
> > and
> > > > > have
> > > > > > >> some
> > > > > > >> > > > > opinions
> > > > > > >> > > > > > about how I think it should look that I've already
> > > written
> > > > > > down,
> > > > > > >> > so I
> > > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > > general
> > > > > > >> > streaming
> > > > > > >> > > > SQL
> > > > > > >> > > > > > spec between our three communities (well,
> technically
> > I
> > > > > > >> volunteered
> > > > > > >> > > to
> > > > > > >> > > > do
> > > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> > folks
> > > > > might
> > > > > > >> want
> > > > > > >> > to
> > > > > > >> > > > > join
> > > > > > >> > > > > > in since you're going that direction already anyway
> > and
> > > > will
> > > > > > >> have
> > > > > > >> > > > useful
> > > > > > >> > > > > > insights :-).
> > > > > > >> > > > > >
> > > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one
> is
> > > for
> > > > > > >> context,
> > > > > > >> > > and
> > > > > > >> > > > > >    really only mentions SQL in passing. But it
> > describes
> > > > the
> > > > > > >> > > > relationship
> > > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> > way
> > > > of
> > > > > > >> > thinking,
> > > > > > >> > > > > which
> > > > > > >> > > > > >    turns out to be useful in understanding what
> robust
> > > > > > >> streaming in
> > > > > > >> > > SQL
> > > > > > >> > > > > > might
> > > > > > >> > > > > >    look like. Many of you probably already know some
> > or
> > > > all
> > > > > of
> > > > > > >> > what's
> > > > > > >> > > > in
> > > > > > >> > > > > > here,
> > > > > > >> > > > > >    but I felt it was necessary to have it all
> written
> > > down
> > > > > in
> > > > > > >> order
> > > > > > >> > > to
> > > > > > >> > > > > > justify
> > > > > > >> > > > > >    some of the proposals I wanted to make in the
> > second
> > > > doc.
> > > > > > >> > > > > >
> > > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal
> for
> > > this
> > > > > doc
> > > > > > >> is
> > > > > > >> > > that
> > > > > > >> > > > it
> > > > > > >> > > > > >    would become a general specification for what
> > robust
> > > > > > >> streaming
> > > > > > >> > SQL
> > > > > > >> > > > in
> > > > > > >> > > > > >    Calcite should look like. It would start out as a
> > > basic
> > > > > > >> proposal
> > > > > > >> > > of
> > > > > > >> > > > > what
> > > > > > >> > > > > >    things *could* look like (combining both what
> > things
> > > > look
> > > > > > >> like
> > > > > > >> > now
> > > > > > >> > > > as
> > > > > > >> > > > > > well
> > > > > > >> > > > > >    as a set of proposed changes for the future), and
> > we
> > > > > could
> > > > > > >> all
> > > > > > >> > > > iterate
> > > > > > >> > > > > > on
> > > > > > >> > > > > >    it together until we get to something we're happy
> > > with.
> > > > > > >> > > > > >
> > > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit
> of
> > a
> > > > > > monster,
> > > > > > >> > so I
> > > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > > comments
> > > > > if
> > > > > > >> they
> > > > > > >> > > > have
> > > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > > meantime.
> > > > > > As
> > > > > > >> > part
> > > > > > >> > > > of
> > > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> > thread
> > > > to
> > > > > > try
> > > > > > >> to
> > > > > > >> > > > gather
> > > > > > >> > > > > > input on what things are already in flight for
> > streaming
> > > > SQL
> > > > > > >> across
> > > > > > >> > > the
> > > > > > >> > > > > > various communities, to make sure the proposal
> > captures
> > > > > > >> everything
> > > > > > >> > > > that's
> > > > > > >> > > > > > going on as accurately as it can.
> > > > > > >> > > > > >
> > > > > > >> > > > > > If you have any questions or comments, I'm
> interested
> > to
> > > > > hear
> > > > > > >> them.
> > > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams
> &
> > > > > Tables":
> > > > > > >> > > > > >
> > > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Tyler
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Being able to support an EMIT config independent of the query itself sounds
great for compatible use cases (which should be many :-).

Shaoxuan, can you please refresh my memory what a dynamic table means in
Flink? It's basically just a state table, right? The "dynamic" part of the
name is to simply to imply it can evolve and change over time?

-Tyler


On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:

> Thanks to Tyler and Fabian for sharing your thoughts.
>
> Regarding to the early/late update control of FLINK. IMO, each dynamic
> table can have an EMIT config. For FLINK table-API, this can be easily
> implemented in different manners, case by case. For instance, in window
> aggregate, we could define "when EMIT a result" via a windowConf per each
> window when we create windows. Unfortunately we do not have such
> flexibility (as compared with TableAPI) in SQL query, we need find a way to
> embed this EMIT config.
>
> Regards,
> Shaoxuan
>
>
> On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> wrote:
> > >
> > > > Hi Tyler,
> > > >
> > > > thank you very much for this excellent write-up and the super nice
> > > > visualizations!
> > > > You are discussing a lot of the things that we have been thinking
> about
> > > as
> > > > well from a different perspective.
> > > > IMO, yours and the Flink model are pretty much aligned although we
> use
> > a
> > > > different terminology (which is not yet completely established). So
> > there
> > > > should be room for unification ;-)
> > > >
> > >
> > > Good to hear, thanks for giving it a look. :-)
> > >
> > >
> > > > Allow me a few words on the current state in Flink. In the upcoming
> > 1.3.0
> > > > release, we will have support for group window (TUMBLE, HOP,
> SESSION),
> > > OVER
> > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > aggregations. The group windows are triggered by watermark and the
> over
> > > > window and non-windowed aggregations emit for each input record
> > > > (AtCount(1)). The window aggregations do not support early or late
> > firing
> > > > (late records are dropped), so no updates here. However, the
> > non-windowed
> > > > aggregates produce updates (in acc and acc/retract mode). Based on
> this
> > > we
> > > > will work on better control for late updates and early firing as well
> > as
> > > > joins in the next release.
> > > >
> > >
> > > Impressive. At this rate there's a good chance we'll just be doing
> > catchup
> > > and thanking you for building everything. ;-) Do you have ideas for
> what
> > > you want your early/late updates control to look like? That's one of
> the
> > > areas I'd like to see better defined for us. And how deep are you going
> > > with joins?
> > >
> >
> > Right now (well actually I merged the change 1h ago) we are using a
> > QueryConfig object to specify state retention intervals to be able to
> clean
> > up state for inactive keys.
> > A running a query looks like this:
> >
> > // -------
> > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > val tEnv = TableEnvironment.getTableEnvironment(env)
> >
> > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12))
> //
> > state of inactive keys is kept for 12 hours
> >
> > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> > a")
> > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> > boolean flag for acc/retract
> >
> > env.execute()
> > // -------
> >
> > We plan to use the QueryConfig also to specify early/late updates. Our
> main
> > motivation is to have a uniform and standard SQL for batch and streaming.
> > Hence, we have to move the configuration out of the query. But I agree,
> > that it would be very nice to be able to include it in the query. I think
> > it should not be difficult to optionally support an EMIT clause as well.
> >
> >
> > >
> > > > Reading the document, I did not find any major difference in our
> > > concepts.
> > > > In fact, we are aiming to support the cases you are describing as
> well.
> > > > I have a question though. Would you classify an OVER aggregation as a
> > > > stream -> stream or stream -> table operation? It collects records to
> > > > aggregate them, but emits one record for each input row. Depending on
> > the
> > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> > and
> > > > emit the result record when the input record is received.
> > > >
> > >
> > > I would call it a composite stream → stream operation (since SQL, like
> > the
> > > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > > streams/tables operations) consisting of a stream → table windowed
> > grouping
> > > followed by a table → stream triggering on every element, basically as
> > you
> > > described in the previous paragraph.
> > >
> > >
> > That makes sense. Thanks :-)
> >
> >
> > > -Tyler
> > >
> > >
> > > >
> > > > I'm looking forward to the second part.
> > > >
> > > > Cheers, Fabian
> > > >
> > > >
> > > >
> > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > > >
> > > > > Any thoughts here Fabian? I'm planning to start sending out some
> more
> > > > > emails towards the end of the week.
> > > > >
> > > > > -Tyler
> > > > >
> > > > >
> > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > > wrote:
> > > > >
> > > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > > stuff
> > > > > up.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> fhueske@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Hi Tyler,
> > > > > >>
> > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > >> I haven't managed to read the doc yet, but just wanted to thank
> > you
> > > > for
> > > > > >> the
> > > > > >> write-up and let you know that I'm very interested in this
> > > discussion.
> > > > > >>
> > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> quite
> > > > busy
> > > > > >> getting as many contributions merged before the release is
> forked
> > > off.
> > > > > >> When that happened, I'll have more time to read and comment.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Fabian
> > > > > >>
> > > > > >>
> > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> <takidau@google.com.invalid
> > > >:
> > > > > >>
> > > > > >> > Good point, when you start talking about anything less than a
> > full
> > > > > join,
> > > > > >> > triggers get involved to describe how one actually achieves
> the
> > > > > desired
> > > > > >> > semantics, and they may end up being tied to just one of the
> > > inputs
> > > > > >> (e.g.,
> > > > > >> > you may only care about the watermark for one side of the
> join).
> > > Am
> > > > > >> > expecting us to address these sorts of details more precisely
> in
> > > doc
> > > > > #2.
> > > > > >> >
> > > > > >> > -Tyler
> > > > > >> >
> > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > <klk@google.com.invalid
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > There's something to be said about having different
> triggering
> > > > > >> depending
> > > > > >> > on
> > > > > >> > > which side of a join data comes from, perhaps?
> > > > > >> > >
> > > > > >> > > (delightful doc, as usual)
> > > > > >> > >
> > > > > >> > > Kenn
> > > > > >> > >
> > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > >> <takidau@google.com.invalid
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK
> is
> > > > > >> basically
> > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > merges
> > > > the
> > > > > >> > input
> > > > > >> > > > streams into a single output stream. GBK then groups the
> > data
> > > > > within
> > > > > >> > that
> > > > > >> > > > single union stream as you might otherwise expect,
> yielding
> > a
> > > > > single
> > > > > >> > > table.
> > > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > > >> aggregation,
> > > > > >> > > > window merging etc all just act upon the merged stream and
> > > > > generate
> > > > > >> > what
> > > > > >> > > is
> > > > > >> > > > effectively a merged table.
> > > > > >> > > >
> > > > > >> > > > -Tyler
> > > > > >> > > >
> > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > >> <lcwik@google.com.invalid
> > > > > >> > >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > The doc is a good read.
> > > > > >> > > > >
> > > > > >> > > > > I think you do a great job of explaining table ->
> stream,
> > > > stream
> > > > > >> ->
> > > > > >> > > > stream,
> > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > >> > > > > But when there are multiple streams reading/writing to a
> > > > table,
> > > > > >> how
> > > > > >> > > does
> > > > > >> > > > > that impact what occurs?
> > > > > >> > > > > For example, with CoGBK you have multiple streams
> writing
> > > to a
> > > > > >> table,
> > > > > >> > > how
> > > > > >> > > > > does that impact window merging?
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > >> > > <takidau@google.com.invalid
> > > > > >> > > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >> > > > > >
> > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > might
> > > > be
> > > > > >> > > something
> > > > > >> > > > > all
> > > > > >> > > > > > three communities would find relevant.
> > > > > >> > > > > >
> > > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > > Calcite,
> > > > > >> > > thanks
> > > > > >> > > > to
> > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> some
> > > > > >> conclusion
> > > > > >> > > > about
> > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > >> functionality
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > > community
> > > > > >> > have
> > > > > >> > > > been
> > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > amongst
> > > > > >> others,
> > > > > >> > > > thank
> > > > > >> > > > > > you! :-), but from my understanding we still don't
> have
> > a
> > > > full
> > > > > >> spec
> > > > > >> > > for
> > > > > >> > > > > how
> > > > > >> > > > > > to support robust streaming in SQL (including but not
> > > > limited
> > > > > >> to,
> > > > > >> > > > e.g., a
> > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > >> > > > > >
> > > > > >> > > > > > I've been spending a lot of time thinking about this
> and
> > > > have
> > > > > >> some
> > > > > >> > > > > opinions
> > > > > >> > > > > > about how I think it should look that I've already
> > written
> > > > > down,
> > > > > >> > so I
> > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > general
> > > > > >> > streaming
> > > > > >> > > > SQL
> > > > > >> > > > > > spec between our three communities (well, technically
> I
> > > > > >> volunteered
> > > > > >> > > to
> > > > > >> > > > do
> > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> folks
> > > > might
> > > > > >> want
> > > > > >> > to
> > > > > >> > > > > join
> > > > > >> > > > > > in since you're going that direction already anyway
> and
> > > will
> > > > > >> have
> > > > > >> > > > useful
> > > > > >> > > > > > insights :-).
> > > > > >> > > > > >
> > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > >> > > > > >
> > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> > for
> > > > > >> context,
> > > > > >> > > and
> > > > > >> > > > > >    really only mentions SQL in passing. But it
> describes
> > > the
> > > > > >> > > > relationship
> > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> way
> > > of
> > > > > >> > thinking,
> > > > > >> > > > > which
> > > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > > >> streaming in
> > > > > >> > > SQL
> > > > > >> > > > > > might
> > > > > >> > > > > >    look like. Many of you probably already know some
> or
> > > all
> > > > of
> > > > > >> > what's
> > > > > >> > > > in
> > > > > >> > > > > > here,
> > > > > >> > > > > >    but I felt it was necessary to have it all written
> > down
> > > > in
> > > > > >> order
> > > > > >> > > to
> > > > > >> > > > > > justify
> > > > > >> > > > > >    some of the proposals I wanted to make in the
> second
> > > doc.
> > > > > >> > > > > >
> > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> > this
> > > > doc
> > > > > >> is
> > > > > >> > > that
> > > > > >> > > > it
> > > > > >> > > > > >    would become a general specification for what
> robust
> > > > > >> streaming
> > > > > >> > SQL
> > > > > >> > > > in
> > > > > >> > > > > >    Calcite should look like. It would start out as a
> > basic
> > > > > >> proposal
> > > > > >> > > of
> > > > > >> > > > > what
> > > > > >> > > > > >    things *could* look like (combining both what
> things
> > > look
> > > > > >> like
> > > > > >> > now
> > > > > >> > > > as
> > > > > >> > > > > > well
> > > > > >> > > > > >    as a set of proposed changes for the future), and
> we
> > > > could
> > > > > >> all
> > > > > >> > > > iterate
> > > > > >> > > > > > on
> > > > > >> > > > > >    it together until we get to something we're happy
> > with.
> > > > > >> > > > > >
> > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of
> a
> > > > > monster,
> > > > > >> > so I
> > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > comments
> > > > if
> > > > > >> they
> > > > > >> > > > have
> > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > meantime.
> > > > > As
> > > > > >> > part
> > > > > >> > > > of
> > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> thread
> > > to
> > > > > try
> > > > > >> to
> > > > > >> > > > gather
> > > > > >> > > > > > input on what things are already in flight for
> streaming
> > > SQL
> > > > > >> across
> > > > > >> > > the
> > > > > >> > > > > > various communities, to make sure the proposal
> captures
> > > > > >> everything
> > > > > >> > > > that's
> > > > > >> > > > > > going on as accurately as it can.
> > > > > >> > > > > >
> > > > > >> > > > > > If you have any questions or comments, I'm interested
> to
> > > > hear
> > > > > >> them.
> > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > > Tables":
> > > > > >> > > > > >
> > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >> > > > > >
> > > > > >> > > > > > -Tyler
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Being able to support an EMIT config independent of the query itself sounds
great for compatible use cases (which should be many :-).

Shaoxuan, can you please refresh my memory what a dynamic table means in
Flink? It's basically just a state table, right? The "dynamic" part of the
name is to simply to imply it can evolve and change over time?

-Tyler


On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:

> Thanks to Tyler and Fabian for sharing your thoughts.
>
> Regarding to the early/late update control of FLINK. IMO, each dynamic
> table can have an EMIT config. For FLINK table-API, this can be easily
> implemented in different manners, case by case. For instance, in window
> aggregate, we could define "when EMIT a result" via a windowConf per each
> window when we create windows. Unfortunately we do not have such
> flexibility (as compared with TableAPI) in SQL query, we need find a way to
> embed this EMIT config.
>
> Regards,
> Shaoxuan
>
>
> On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> wrote:
> > >
> > > > Hi Tyler,
> > > >
> > > > thank you very much for this excellent write-up and the super nice
> > > > visualizations!
> > > > You are discussing a lot of the things that we have been thinking
> about
> > > as
> > > > well from a different perspective.
> > > > IMO, yours and the Flink model are pretty much aligned although we
> use
> > a
> > > > different terminology (which is not yet completely established). So
> > there
> > > > should be room for unification ;-)
> > > >
> > >
> > > Good to hear, thanks for giving it a look. :-)
> > >
> > >
> > > > Allow me a few words on the current state in Flink. In the upcoming
> > 1.3.0
> > > > release, we will have support for group window (TUMBLE, HOP,
> SESSION),
> > > OVER
> > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > aggregations. The group windows are triggered by watermark and the
> over
> > > > window and non-windowed aggregations emit for each input record
> > > > (AtCount(1)). The window aggregations do not support early or late
> > firing
> > > > (late records are dropped), so no updates here. However, the
> > non-windowed
> > > > aggregates produce updates (in acc and acc/retract mode). Based on
> this
> > > we
> > > > will work on better control for late updates and early firing as well
> > as
> > > > joins in the next release.
> > > >
> > >
> > > Impressive. At this rate there's a good chance we'll just be doing
> > catchup
> > > and thanking you for building everything. ;-) Do you have ideas for
> what
> > > you want your early/late updates control to look like? That's one of
> the
> > > areas I'd like to see better defined for us. And how deep are you going
> > > with joins?
> > >
> >
> > Right now (well actually I merged the change 1h ago) we are using a
> > QueryConfig object to specify state retention intervals to be able to
> clean
> > up state for inactive keys.
> > A running a query looks like this:
> >
> > // -------
> > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > val tEnv = TableEnvironment.getTableEnvironment(env)
> >
> > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12))
> //
> > state of inactive keys is kept for 12 hours
> >
> > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> > a")
> > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> > boolean flag for acc/retract
> >
> > env.execute()
> > // -------
> >
> > We plan to use the QueryConfig also to specify early/late updates. Our
> main
> > motivation is to have a uniform and standard SQL for batch and streaming.
> > Hence, we have to move the configuration out of the query. But I agree,
> > that it would be very nice to be able to include it in the query. I think
> > it should not be difficult to optionally support an EMIT clause as well.
> >
> >
> > >
> > > > Reading the document, I did not find any major difference in our
> > > concepts.
> > > > In fact, we are aiming to support the cases you are describing as
> well.
> > > > I have a question though. Would you classify an OVER aggregation as a
> > > > stream -> stream or stream -> table operation? It collects records to
> > > > aggregate them, but emits one record for each input row. Depending on
> > the
> > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> > and
> > > > emit the result record when the input record is received.
> > > >
> > >
> > > I would call it a composite stream → stream operation (since SQL, like
> > the
> > > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > > streams/tables operations) consisting of a stream → table windowed
> > grouping
> > > followed by a table → stream triggering on every element, basically as
> > you
> > > described in the previous paragraph.
> > >
> > >
> > That makes sense. Thanks :-)
> >
> >
> > > -Tyler
> > >
> > >
> > > >
> > > > I'm looking forward to the second part.
> > > >
> > > > Cheers, Fabian
> > > >
> > > >
> > > >
> > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > > >
> > > > > Any thoughts here Fabian? I'm planning to start sending out some
> more
> > > > > emails towards the end of the week.
> > > > >
> > > > > -Tyler
> > > > >
> > > > >
> > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > > wrote:
> > > > >
> > > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > > stuff
> > > > > up.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> fhueske@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Hi Tyler,
> > > > > >>
> > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > >> I haven't managed to read the doc yet, but just wanted to thank
> > you
> > > > for
> > > > > >> the
> > > > > >> write-up and let you know that I'm very interested in this
> > > discussion.
> > > > > >>
> > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> quite
> > > > busy
> > > > > >> getting as many contributions merged before the release is
> forked
> > > off.
> > > > > >> When that happened, I'll have more time to read and comment.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Fabian
> > > > > >>
> > > > > >>
> > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> <takidau@google.com.invalid
> > > >:
> > > > > >>
> > > > > >> > Good point, when you start talking about anything less than a
> > full
> > > > > join,
> > > > > >> > triggers get involved to describe how one actually achieves
> the
> > > > > desired
> > > > > >> > semantics, and they may end up being tied to just one of the
> > > inputs
> > > > > >> (e.g.,
> > > > > >> > you may only care about the watermark for one side of the
> join).
> > > Am
> > > > > >> > expecting us to address these sorts of details more precisely
> in
> > > doc
> > > > > #2.
> > > > > >> >
> > > > > >> > -Tyler
> > > > > >> >
> > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > <klk@google.com.invalid
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > There's something to be said about having different
> triggering
> > > > > >> depending
> > > > > >> > on
> > > > > >> > > which side of a join data comes from, perhaps?
> > > > > >> > >
> > > > > >> > > (delightful doc, as usual)
> > > > > >> > >
> > > > > >> > > Kenn
> > > > > >> > >
> > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > >> <takidau@google.com.invalid
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK
> is
> > > > > >> basically
> > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > merges
> > > > the
> > > > > >> > input
> > > > > >> > > > streams into a single output stream. GBK then groups the
> > data
> > > > > within
> > > > > >> > that
> > > > > >> > > > single union stream as you might otherwise expect,
> yielding
> > a
> > > > > single
> > > > > >> > > table.
> > > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > > >> aggregation,
> > > > > >> > > > window merging etc all just act upon the merged stream and
> > > > > generate
> > > > > >> > what
> > > > > >> > > is
> > > > > >> > > > effectively a merged table.
> > > > > >> > > >
> > > > > >> > > > -Tyler
> > > > > >> > > >
> > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > >> <lcwik@google.com.invalid
> > > > > >> > >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > The doc is a good read.
> > > > > >> > > > >
> > > > > >> > > > > I think you do a great job of explaining table ->
> stream,
> > > > stream
> > > > > >> ->
> > > > > >> > > > stream,
> > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > >> > > > > But when there are multiple streams reading/writing to a
> > > > table,
> > > > > >> how
> > > > > >> > > does
> > > > > >> > > > > that impact what occurs?
> > > > > >> > > > > For example, with CoGBK you have multiple streams
> writing
> > > to a
> > > > > >> table,
> > > > > >> > > how
> > > > > >> > > > > does that impact window merging?
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > >> > > <takidau@google.com.invalid
> > > > > >> > > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >> > > > > >
> > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > might
> > > > be
> > > > > >> > > something
> > > > > >> > > > > all
> > > > > >> > > > > > three communities would find relevant.
> > > > > >> > > > > >
> > > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > > Calcite,
> > > > > >> > > thanks
> > > > > >> > > > to
> > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> some
> > > > > >> conclusion
> > > > > >> > > > about
> > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > >> functionality
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > > community
> > > > > >> > have
> > > > > >> > > > been
> > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > amongst
> > > > > >> others,
> > > > > >> > > > thank
> > > > > >> > > > > > you! :-), but from my understanding we still don't
> have
> > a
> > > > full
> > > > > >> spec
> > > > > >> > > for
> > > > > >> > > > > how
> > > > > >> > > > > > to support robust streaming in SQL (including but not
> > > > limited
> > > > > >> to,
> > > > > >> > > > e.g., a
> > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > >> > > > > >
> > > > > >> > > > > > I've been spending a lot of time thinking about this
> and
> > > > have
> > > > > >> some
> > > > > >> > > > > opinions
> > > > > >> > > > > > about how I think it should look that I've already
> > written
> > > > > down,
> > > > > >> > so I
> > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > general
> > > > > >> > streaming
> > > > > >> > > > SQL
> > > > > >> > > > > > spec between our three communities (well, technically
> I
> > > > > >> volunteered
> > > > > >> > > to
> > > > > >> > > > do
> > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> folks
> > > > might
> > > > > >> want
> > > > > >> > to
> > > > > >> > > > > join
> > > > > >> > > > > > in since you're going that direction already anyway
> and
> > > will
> > > > > >> have
> > > > > >> > > > useful
> > > > > >> > > > > > insights :-).
> > > > > >> > > > > >
> > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > >> > > > > >
> > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> > for
> > > > > >> context,
> > > > > >> > > and
> > > > > >> > > > > >    really only mentions SQL in passing. But it
> describes
> > > the
> > > > > >> > > > relationship
> > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> way
> > > of
> > > > > >> > thinking,
> > > > > >> > > > > which
> > > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > > >> streaming in
> > > > > >> > > SQL
> > > > > >> > > > > > might
> > > > > >> > > > > >    look like. Many of you probably already know some
> or
> > > all
> > > > of
> > > > > >> > what's
> > > > > >> > > > in
> > > > > >> > > > > > here,
> > > > > >> > > > > >    but I felt it was necessary to have it all written
> > down
> > > > in
> > > > > >> order
> > > > > >> > > to
> > > > > >> > > > > > justify
> > > > > >> > > > > >    some of the proposals I wanted to make in the
> second
> > > doc.
> > > > > >> > > > > >
> > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> > this
> > > > doc
> > > > > >> is
> > > > > >> > > that
> > > > > >> > > > it
> > > > > >> > > > > >    would become a general specification for what
> robust
> > > > > >> streaming
> > > > > >> > SQL
> > > > > >> > > > in
> > > > > >> > > > > >    Calcite should look like. It would start out as a
> > basic
> > > > > >> proposal
> > > > > >> > > of
> > > > > >> > > > > what
> > > > > >> > > > > >    things *could* look like (combining both what
> things
> > > look
> > > > > >> like
> > > > > >> > now
> > > > > >> > > > as
> > > > > >> > > > > > well
> > > > > >> > > > > >    as a set of proposed changes for the future), and
> we
> > > > could
> > > > > >> all
> > > > > >> > > > iterate
> > > > > >> > > > > > on
> > > > > >> > > > > >    it together until we get to something we're happy
> > with.
> > > > > >> > > > > >
> > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of
> a
> > > > > monster,
> > > > > >> > so I
> > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > comments
> > > > if
> > > > > >> they
> > > > > >> > > > have
> > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > meantime.
> > > > > As
> > > > > >> > part
> > > > > >> > > > of
> > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> thread
> > > to
> > > > > try
> > > > > >> to
> > > > > >> > > > gather
> > > > > >> > > > > > input on what things are already in flight for
> streaming
> > > SQL
> > > > > >> across
> > > > > >> > > the
> > > > > >> > > > > > various communities, to make sure the proposal
> captures
> > > > > >> everything
> > > > > >> > > > that's
> > > > > >> > > > > > going on as accurately as it can.
> > > > > >> > > > > >
> > > > > >> > > > > > If you have any questions or comments, I'm interested
> to
> > > > hear
> > > > > >> them.
> > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > > Tables":
> > > > > >> > > > > >
> > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >> > > > > >
> > > > > >> > > > > > -Tyler
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Being able to support an EMIT config independent of the query itself sounds
great for compatible use cases (which should be many :-).

Shaoxuan, can you please refresh my memory what a dynamic table means in
Flink? It's basically just a state table, right? The "dynamic" part of the
name is to simply to imply it can evolve and change over time?

-Tyler


On Fri, May 12, 2017 at 1:59 AM Shaoxuan Wang <ws...@gmail.com> wrote:

> Thanks to Tyler and Fabian for sharing your thoughts.
>
> Regarding to the early/late update control of FLINK. IMO, each dynamic
> table can have an EMIT config. For FLINK table-API, this can be easily
> implemented in different manners, case by case. For instance, in window
> aggregate, we could define "when EMIT a result" via a windowConf per each
> window when we create windows. Unfortunately we do not have such
> flexibility (as compared with TableAPI) in SQL query, we need find a way to
> embed this EMIT config.
>
> Regards,
> Shaoxuan
>
>
> On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com>
> wrote:
> > >
> > > > Hi Tyler,
> > > >
> > > > thank you very much for this excellent write-up and the super nice
> > > > visualizations!
> > > > You are discussing a lot of the things that we have been thinking
> about
> > > as
> > > > well from a different perspective.
> > > > IMO, yours and the Flink model are pretty much aligned although we
> use
> > a
> > > > different terminology (which is not yet completely established). So
> > there
> > > > should be room for unification ;-)
> > > >
> > >
> > > Good to hear, thanks for giving it a look. :-)
> > >
> > >
> > > > Allow me a few words on the current state in Flink. In the upcoming
> > 1.3.0
> > > > release, we will have support for group window (TUMBLE, HOP,
> SESSION),
> > > OVER
> > > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > > aggregations. The group windows are triggered by watermark and the
> over
> > > > window and non-windowed aggregations emit for each input record
> > > > (AtCount(1)). The window aggregations do not support early or late
> > firing
> > > > (late records are dropped), so no updates here. However, the
> > non-windowed
> > > > aggregates produce updates (in acc and acc/retract mode). Based on
> this
> > > we
> > > > will work on better control for late updates and early firing as well
> > as
> > > > joins in the next release.
> > > >
> > >
> > > Impressive. At this rate there's a good chance we'll just be doing
> > catchup
> > > and thanking you for building everything. ;-) Do you have ideas for
> what
> > > you want your early/late updates control to look like? That's one of
> the
> > > areas I'd like to see better defined for us. And how deep are you going
> > > with joins?
> > >
> >
> > Right now (well actually I merged the change 1h ago) we are using a
> > QueryConfig object to specify state retention intervals to be able to
> clean
> > up state for inactive keys.
> > A running a query looks like this:
> >
> > // -------
> > val env = StreamExecutionEnvironment.getExecutionEnvironment
> > val tEnv = TableEnvironment.getTableEnvironment(env)
> >
> > val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12))
> //
> > state of inactive keys is kept for 12 hours
> >
> > val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> > a")
> > val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> > boolean flag for acc/retract
> >
> > env.execute()
> > // -------
> >
> > We plan to use the QueryConfig also to specify early/late updates. Our
> main
> > motivation is to have a uniform and standard SQL for batch and streaming.
> > Hence, we have to move the configuration out of the query. But I agree,
> > that it would be very nice to be able to include it in the query. I think
> > it should not be difficult to optionally support an EMIT clause as well.
> >
> >
> > >
> > > > Reading the document, I did not find any major difference in our
> > > concepts.
> > > > In fact, we are aiming to support the cases you are describing as
> well.
> > > > I have a question though. Would you classify an OVER aggregation as a
> > > > stream -> stream or stream -> table operation? It collects records to
> > > > aggregate them, but emits one record for each input row. Depending on
> > the
> > > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> > and
> > > > emit the result record when the input record is received.
> > > >
> > >
> > > I would call it a composite stream → stream operation (since SQL, like
> > the
> > > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > > streams/tables operations) consisting of a stream → table windowed
> > grouping
> > > followed by a table → stream triggering on every element, basically as
> > you
> > > described in the previous paragraph.
> > >
> > >
> > That makes sense. Thanks :-)
> >
> >
> > > -Tyler
> > >
> > >
> > > >
> > > > I'm looking forward to the second part.
> > > >
> > > > Cheers, Fabian
> > > >
> > > >
> > > >
> > > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > > >
> > > > > Any thoughts here Fabian? I'm planning to start sending out some
> more
> > > > > emails towards the end of the week.
> > > > >
> > > > > -Tyler
> > > > >
> > > > >
> > > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > > wrote:
> > > > >
> > > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > > stuff
> > > > > up.
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <
> fhueske@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Hi Tyler,
> > > > > >>
> > > > > >> thanks for pushing this effort and including the Flink list.
> > > > > >> I haven't managed to read the doc yet, but just wanted to thank
> > you
> > > > for
> > > > > >> the
> > > > > >> write-up and let you know that I'm very interested in this
> > > discussion.
> > > > > >>
> > > > > >> We are very close to the feature freeze of Flink 1.3 and I'm
> quite
> > > > busy
> > > > > >> getting as many contributions merged before the release is
> forked
> > > off.
> > > > > >> When that happened, I'll have more time to read and comment.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Fabian
> > > > > >>
> > > > > >>
> > > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau
> <takidau@google.com.invalid
> > > >:
> > > > > >>
> > > > > >> > Good point, when you start talking about anything less than a
> > full
> > > > > join,
> > > > > >> > triggers get involved to describe how one actually achieves
> the
> > > > > desired
> > > > > >> > semantics, and they may end up being tied to just one of the
> > > inputs
> > > > > >> (e.g.,
> > > > > >> > you may only care about the watermark for one side of the
> join).
> > > Am
> > > > > >> > expecting us to address these sorts of details more precisely
> in
> > > doc
> > > > > #2.
> > > > > >> >
> > > > > >> > -Tyler
> > > > > >> >
> > > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > > <klk@google.com.invalid
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > There's something to be said about having different
> triggering
> > > > > >> depending
> > > > > >> > on
> > > > > >> > > which side of a join data comes from, perhaps?
> > > > > >> > >
> > > > > >> > > (delightful doc, as usual)
> > > > > >> > >
> > > > > >> > > Kenn
> > > > > >> > >
> > > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > > >> <takidau@google.com.invalid
> > > > > >> > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK
> is
> > > > > >> basically
> > > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> > merges
> > > > the
> > > > > >> > input
> > > > > >> > > > streams into a single output stream. GBK then groups the
> > data
> > > > > within
> > > > > >> > that
> > > > > >> > > > single union stream as you might otherwise expect,
> yielding
> > a
> > > > > single
> > > > > >> > > table.
> > > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > > >> aggregation,
> > > > > >> > > > window merging etc all just act upon the merged stream and
> > > > > generate
> > > > > >> > what
> > > > > >> > > is
> > > > > >> > > > effectively a merged table.
> > > > > >> > > >
> > > > > >> > > > -Tyler
> > > > > >> > > >
> > > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > > >> <lcwik@google.com.invalid
> > > > > >> > >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > The doc is a good read.
> > > > > >> > > > >
> > > > > >> > > > > I think you do a great job of explaining table ->
> stream,
> > > > stream
> > > > > >> ->
> > > > > >> > > > stream,
> > > > > >> > > > > and stream -> table when there is only one stream.
> > > > > >> > > > > But when there are multiple streams reading/writing to a
> > > > table,
> > > > > >> how
> > > > > >> > > does
> > > > > >> > > > > that impact what occurs?
> > > > > >> > > > > For example, with CoGBK you have multiple streams
> writing
> > > to a
> > > > > >> table,
> > > > > >> > > how
> > > > > >> > > > > does that impact window merging?
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > > >> > > <takidau@google.com.invalid
> > > > > >> > > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >> > > > > >
> > > > > >> > > > > > Apologies for the big cross post, but I thought this
> > might
> > > > be
> > > > > >> > > something
> > > > > >> > > > > all
> > > > > >> > > > > > three communities would find relevant.
> > > > > >> > > > > >
> > > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > > Calcite,
> > > > > >> > > thanks
> > > > > >> > > > to
> > > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to
> some
> > > > > >> conclusion
> > > > > >> > > > about
> > > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > > >> functionality
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > > community
> > > > > >> > have
> > > > > >> > > > been
> > > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > > amongst
> > > > > >> others,
> > > > > >> > > > thank
> > > > > >> > > > > > you! :-), but from my understanding we still don't
> have
> > a
> > > > full
> > > > > >> spec
> > > > > >> > > for
> > > > > >> > > > > how
> > > > > >> > > > > > to support robust streaming in SQL (including but not
> > > > limited
> > > > > >> to,
> > > > > >> > > > e.g., a
> > > > > >> > > > > > triggers analogue such as EMIT).
> > > > > >> > > > > >
> > > > > >> > > > > > I've been spending a lot of time thinking about this
> and
> > > > have
> > > > > >> some
> > > > > >> > > > > opinions
> > > > > >> > > > > > about how I think it should look that I've already
> > written
> > > > > down,
> > > > > >> > so I
> > > > > >> > > > > > volunteered to try to drive forward agreement on a
> > general
> > > > > >> > streaming
> > > > > >> > > > SQL
> > > > > >> > > > > > spec between our three communities (well, technically
> I
> > > > > >> volunteered
> > > > > >> > > to
> > > > > >> > > > do
> > > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink
> folks
> > > > might
> > > > > >> want
> > > > > >> > to
> > > > > >> > > > > join
> > > > > >> > > > > > in since you're going that direction already anyway
> and
> > > will
> > > > > >> have
> > > > > >> > > > useful
> > > > > >> > > > > > insights :-).
> > > > > >> > > > > >
> > > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > > >> > > > > >
> > > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> > for
> > > > > >> context,
> > > > > >> > > and
> > > > > >> > > > > >    really only mentions SQL in passing. But it
> describes
> > > the
> > > > > >> > > > relationship
> > > > > >> > > > > >    between the Beam Model and the "streams & tables"
> way
> > > of
> > > > > >> > thinking,
> > > > > >> > > > > which
> > > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > > >> streaming in
> > > > > >> > > SQL
> > > > > >> > > > > > might
> > > > > >> > > > > >    look like. Many of you probably already know some
> or
> > > all
> > > > of
> > > > > >> > what's
> > > > > >> > > > in
> > > > > >> > > > > > here,
> > > > > >> > > > > >    but I felt it was necessary to have it all written
> > down
> > > > in
> > > > > >> order
> > > > > >> > > to
> > > > > >> > > > > > justify
> > > > > >> > > > > >    some of the proposals I wanted to make in the
> second
> > > doc.
> > > > > >> > > > > >
> > > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> > this
> > > > doc
> > > > > >> is
> > > > > >> > > that
> > > > > >> > > > it
> > > > > >> > > > > >    would become a general specification for what
> robust
> > > > > >> streaming
> > > > > >> > SQL
> > > > > >> > > > in
> > > > > >> > > > > >    Calcite should look like. It would start out as a
> > basic
> > > > > >> proposal
> > > > > >> > > of
> > > > > >> > > > > what
> > > > > >> > > > > >    things *could* look like (combining both what
> things
> > > look
> > > > > >> like
> > > > > >> > now
> > > > > >> > > > as
> > > > > >> > > > > > well
> > > > > >> > > > > >    as a set of proposed changes for the future), and
> we
> > > > could
> > > > > >> all
> > > > > >> > > > iterate
> > > > > >> > > > > > on
> > > > > >> > > > > >    it together until we get to something we're happy
> > with.
> > > > > >> > > > > >
> > > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of
> a
> > > > > monster,
> > > > > >> > so I
> > > > > >> > > > > > figured I'd share it and let folks hack at it with
> > > comments
> > > > if
> > > > > >> they
> > > > > >> > > > have
> > > > > >> > > > > > any, while I try to get the second doc ready in the
> > > > meantime.
> > > > > As
> > > > > >> > part
> > > > > >> > > > of
> > > > > >> > > > > > getting doc #2 ready, I'll be starting a separate
> thread
> > > to
> > > > > try
> > > > > >> to
> > > > > >> > > > gather
> > > > > >> > > > > > input on what things are already in flight for
> streaming
> > > SQL
> > > > > >> across
> > > > > >> > > the
> > > > > >> > > > > > various communities, to make sure the proposal
> captures
> > > > > >> everything
> > > > > >> > > > that's
> > > > > >> > > > > > going on as accurately as it can.
> > > > > >> > > > > >
> > > > > >> > > > > > If you have any questions or comments, I'm interested
> to
> > > > hear
> > > > > >> them.
> > > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > > Tables":
> > > > > >> > > > > >
> > > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >> > > > > >
> > > > > >> > > > > > -Tyler
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Thanks to Tyler and Fabian for sharing your thoughts.

Regarding to the early/late update control of FLINK. IMO, each dynamic
table can have an EMIT config. For FLINK table-API, this can be easily
implemented in different manners, case by case. For instance, in window
aggregate, we could define "when EMIT a result" via a windowConf per each
window when we create windows. Unfortunately we do not have such
flexibility (as compared with TableAPI) in SQL query, we need find a way to
embed this EMIT config.

Regards,
Shaoxuan


On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:

> 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Hi Tyler,
> > >
> > > thank you very much for this excellent write-up and the super nice
> > > visualizations!
> > > You are discussing a lot of the things that we have been thinking about
> > as
> > > well from a different perspective.
> > > IMO, yours and the Flink model are pretty much aligned although we use
> a
> > > different terminology (which is not yet completely established). So
> there
> > > should be room for unification ;-)
> > >
> >
> > Good to hear, thanks for giving it a look. :-)
> >
> >
> > > Allow me a few words on the current state in Flink. In the upcoming
> 1.3.0
> > > release, we will have support for group window (TUMBLE, HOP, SESSION),
> > OVER
> > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > aggregations. The group windows are triggered by watermark and the over
> > > window and non-windowed aggregations emit for each input record
> > > (AtCount(1)). The window aggregations do not support early or late
> firing
> > > (late records are dropped), so no updates here. However, the
> non-windowed
> > > aggregates produce updates (in acc and acc/retract mode). Based on this
> > we
> > > will work on better control for late updates and early firing as well
> as
> > > joins in the next release.
> > >
> >
> > Impressive. At this rate there's a good chance we'll just be doing
> catchup
> > and thanking you for building everything. ;-) Do you have ideas for what
> > you want your early/late updates control to look like? That's one of the
> > areas I'd like to see better defined for us. And how deep are you going
> > with joins?
> >
>
> Right now (well actually I merged the change 1h ago) we are using a
> QueryConfig object to specify state retention intervals to be able to clean
> up state for inactive keys.
> A running a query looks like this:
>
> // -------
> val env = StreamExecutionEnvironment.getExecutionEnvironment
> val tEnv = TableEnvironment.getTableEnvironment(env)
>
> val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
> state of inactive keys is kept for 12 hours
>
> val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> a")
> val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> boolean flag for acc/retract
>
> env.execute()
> // -------
>
> We plan to use the QueryConfig also to specify early/late updates. Our main
> motivation is to have a uniform and standard SQL for batch and streaming.
> Hence, we have to move the configuration out of the query. But I agree,
> that it would be very nice to be able to include it in the query. I think
> it should not be difficult to optionally support an EMIT clause as well.
>
>
> >
> > > Reading the document, I did not find any major difference in our
> > concepts.
> > > In fact, we are aiming to support the cases you are describing as well.
> > > I have a question though. Would you classify an OVER aggregation as a
> > > stream -> stream or stream -> table operation? It collects records to
> > > aggregate them, but emits one record for each input row. Depending on
> the
> > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> and
> > > emit the result record when the input record is received.
> > >
> >
> > I would call it a composite stream → stream operation (since SQL, like
> the
> > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > streams/tables operations) consisting of a stream → table windowed
> grouping
> > followed by a table → stream triggering on every element, basically as
> you
> > described in the previous paragraph.
> >
> >
> That makes sense. Thanks :-)
>
>
> > -Tyler
> >
> >
> > >
> > > I'm looking forward to the second part.
> > >
> > > Cheers, Fabian
> > >
> > >
> > >
> > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > > emails towards the end of the week.
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > wrote:
> > > >
> > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > stuff
> > > > up.
> > > > >
> > > > > -Tyler
> > > > >
> > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Tyler,
> > > > >>
> > > > >> thanks for pushing this effort and including the Flink list.
> > > > >> I haven't managed to read the doc yet, but just wanted to thank
> you
> > > for
> > > > >> the
> > > > >> write-up and let you know that I'm very interested in this
> > discussion.
> > > > >>
> > > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > > busy
> > > > >> getting as many contributions merged before the release is forked
> > off.
> > > > >> When that happened, I'll have more time to read and comment.
> > > > >>
> > > > >> Thanks,
> > > > >> Fabian
> > > > >>
> > > > >>
> > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> > >:
> > > > >>
> > > > >> > Good point, when you start talking about anything less than a
> full
> > > > join,
> > > > >> > triggers get involved to describe how one actually achieves the
> > > > desired
> > > > >> > semantics, and they may end up being tied to just one of the
> > inputs
> > > > >> (e.g.,
> > > > >> > you may only care about the watermark for one side of the join).
> > Am
> > > > >> > expecting us to address these sorts of details more precisely in
> > doc
> > > > #2.
> > > > >> >
> > > > >> > -Tyler
> > > > >> >
> > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > <klk@google.com.invalid
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > There's something to be said about having different triggering
> > > > >> depending
> > > > >> > on
> > > > >> > > which side of a join data comes from, perhaps?
> > > > >> > >
> > > > >> > > (delightful doc, as usual)
> > > > >> > >
> > > > >> > > Kenn
> > > > >> > >
> > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > >> <takidau@google.com.invalid
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > > >> basically
> > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> merges
> > > the
> > > > >> > input
> > > > >> > > > streams into a single output stream. GBK then groups the
> data
> > > > within
> > > > >> > that
> > > > >> > > > single union stream as you might otherwise expect, yielding
> a
> > > > single
> > > > >> > > table.
> > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > >> aggregation,
> > > > >> > > > window merging etc all just act upon the merged stream and
> > > > generate
> > > > >> > what
> > > > >> > > is
> > > > >> > > > effectively a merged table.
> > > > >> > > >
> > > > >> > > > -Tyler
> > > > >> > > >
> > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > >> <lcwik@google.com.invalid
> > > > >> > >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > The doc is a good read.
> > > > >> > > > >
> > > > >> > > > > I think you do a great job of explaining table -> stream,
> > > stream
> > > > >> ->
> > > > >> > > > stream,
> > > > >> > > > > and stream -> table when there is only one stream.
> > > > >> > > > > But when there are multiple streams reading/writing to a
> > > table,
> > > > >> how
> > > > >> > > does
> > > > >> > > > > that impact what occurs?
> > > > >> > > > > For example, with CoGBK you have multiple streams writing
> > to a
> > > > >> table,
> > > > >> > > how
> > > > >> > > > > does that impact window merging?
> > > > >> > > > >
> > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > >> > > <takidau@google.com.invalid
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >> > > > > >
> > > > >> > > > > > Apologies for the big cross post, but I thought this
> might
> > > be
> > > > >> > > something
> > > > >> > > > > all
> > > > >> > > > > > three communities would find relevant.
> > > > >> > > > > >
> > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > Calcite,
> > > > >> > > thanks
> > > > >> > > > to
> > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > > >> conclusion
> > > > >> > > > about
> > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > >> functionality
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > community
> > > > >> > have
> > > > >> > > > been
> > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > amongst
> > > > >> others,
> > > > >> > > > thank
> > > > >> > > > > > you! :-), but from my understanding we still don't have
> a
> > > full
> > > > >> spec
> > > > >> > > for
> > > > >> > > > > how
> > > > >> > > > > > to support robust streaming in SQL (including but not
> > > limited
> > > > >> to,
> > > > >> > > > e.g., a
> > > > >> > > > > > triggers analogue such as EMIT).
> > > > >> > > > > >
> > > > >> > > > > > I've been spending a lot of time thinking about this and
> > > have
> > > > >> some
> > > > >> > > > > opinions
> > > > >> > > > > > about how I think it should look that I've already
> written
> > > > down,
> > > > >> > so I
> > > > >> > > > > > volunteered to try to drive forward agreement on a
> general
> > > > >> > streaming
> > > > >> > > > SQL
> > > > >> > > > > > spec between our three communities (well, technically I
> > > > >> volunteered
> > > > >> > > to
> > > > >> > > > do
> > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > > might
> > > > >> want
> > > > >> > to
> > > > >> > > > > join
> > > > >> > > > > > in since you're going that direction already anyway and
> > will
> > > > >> have
> > > > >> > > > useful
> > > > >> > > > > > insights :-).
> > > > >> > > > > >
> > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > >> > > > > >
> > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> for
> > > > >> context,
> > > > >> > > and
> > > > >> > > > > >    really only mentions SQL in passing. But it describes
> > the
> > > > >> > > > relationship
> > > > >> > > > > >    between the Beam Model and the "streams & tables" way
> > of
> > > > >> > thinking,
> > > > >> > > > > which
> > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > >> streaming in
> > > > >> > > SQL
> > > > >> > > > > > might
> > > > >> > > > > >    look like. Many of you probably already know some or
> > all
> > > of
> > > > >> > what's
> > > > >> > > > in
> > > > >> > > > > > here,
> > > > >> > > > > >    but I felt it was necessary to have it all written
> down
> > > in
> > > > >> order
> > > > >> > > to
> > > > >> > > > > > justify
> > > > >> > > > > >    some of the proposals I wanted to make in the second
> > doc.
> > > > >> > > > > >
> > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> this
> > > doc
> > > > >> is
> > > > >> > > that
> > > > >> > > > it
> > > > >> > > > > >    would become a general specification for what robust
> > > > >> streaming
> > > > >> > SQL
> > > > >> > > > in
> > > > >> > > > > >    Calcite should look like. It would start out as a
> basic
> > > > >> proposal
> > > > >> > > of
> > > > >> > > > > what
> > > > >> > > > > >    things *could* look like (combining both what things
> > look
> > > > >> like
> > > > >> > now
> > > > >> > > > as
> > > > >> > > > > > well
> > > > >> > > > > >    as a set of proposed changes for the future), and we
> > > could
> > > > >> all
> > > > >> > > > iterate
> > > > >> > > > > > on
> > > > >> > > > > >    it together until we get to something we're happy
> with.
> > > > >> > > > > >
> > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > > monster,
> > > > >> > so I
> > > > >> > > > > > figured I'd share it and let folks hack at it with
> > comments
> > > if
> > > > >> they
> > > > >> > > > have
> > > > >> > > > > > any, while I try to get the second doc ready in the
> > > meantime.
> > > > As
> > > > >> > part
> > > > >> > > > of
> > > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> > to
> > > > try
> > > > >> to
> > > > >> > > > gather
> > > > >> > > > > > input on what things are already in flight for streaming
> > SQL
> > > > >> across
> > > > >> > > the
> > > > >> > > > > > various communities, to make sure the proposal captures
> > > > >> everything
> > > > >> > > > that's
> > > > >> > > > > > going on as accurately as it can.
> > > > >> > > > > >
> > > > >> > > > > > If you have any questions or comments, I'm interested to
> > > hear
> > > > >> them.
> > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > Tables":
> > > > >> > > > > >
> > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > >> > > > > >
> > > > >> > > > > > -Tyler
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Thanks to Tyler and Fabian for sharing your thoughts.

Regarding to the early/late update control of FLINK. IMO, each dynamic
table can have an EMIT config. For FLINK table-API, this can be easily
implemented in different manners, case by case. For instance, in window
aggregate, we could define "when EMIT a result" via a windowConf per each
window when we create windows. Unfortunately we do not have such
flexibility (as compared with TableAPI) in SQL query, we need find a way to
embed this EMIT config.

Regards,
Shaoxuan


On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:

> 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Hi Tyler,
> > >
> > > thank you very much for this excellent write-up and the super nice
> > > visualizations!
> > > You are discussing a lot of the things that we have been thinking about
> > as
> > > well from a different perspective.
> > > IMO, yours and the Flink model are pretty much aligned although we use
> a
> > > different terminology (which is not yet completely established). So
> there
> > > should be room for unification ;-)
> > >
> >
> > Good to hear, thanks for giving it a look. :-)
> >
> >
> > > Allow me a few words on the current state in Flink. In the upcoming
> 1.3.0
> > > release, we will have support for group window (TUMBLE, HOP, SESSION),
> > OVER
> > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > aggregations. The group windows are triggered by watermark and the over
> > > window and non-windowed aggregations emit for each input record
> > > (AtCount(1)). The window aggregations do not support early or late
> firing
> > > (late records are dropped), so no updates here. However, the
> non-windowed
> > > aggregates produce updates (in acc and acc/retract mode). Based on this
> > we
> > > will work on better control for late updates and early firing as well
> as
> > > joins in the next release.
> > >
> >
> > Impressive. At this rate there's a good chance we'll just be doing
> catchup
> > and thanking you for building everything. ;-) Do you have ideas for what
> > you want your early/late updates control to look like? That's one of the
> > areas I'd like to see better defined for us. And how deep are you going
> > with joins?
> >
>
> Right now (well actually I merged the change 1h ago) we are using a
> QueryConfig object to specify state retention intervals to be able to clean
> up state for inactive keys.
> A running a query looks like this:
>
> // -------
> val env = StreamExecutionEnvironment.getExecutionEnvironment
> val tEnv = TableEnvironment.getTableEnvironment(env)
>
> val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
> state of inactive keys is kept for 12 hours
>
> val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> a")
> val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> boolean flag for acc/retract
>
> env.execute()
> // -------
>
> We plan to use the QueryConfig also to specify early/late updates. Our main
> motivation is to have a uniform and standard SQL for batch and streaming.
> Hence, we have to move the configuration out of the query. But I agree,
> that it would be very nice to be able to include it in the query. I think
> it should not be difficult to optionally support an EMIT clause as well.
>
>
> >
> > > Reading the document, I did not find any major difference in our
> > concepts.
> > > In fact, we are aiming to support the cases you are describing as well.
> > > I have a question though. Would you classify an OVER aggregation as a
> > > stream -> stream or stream -> table operation? It collects records to
> > > aggregate them, but emits one record for each input row. Depending on
> the
> > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> and
> > > emit the result record when the input record is received.
> > >
> >
> > I would call it a composite stream → stream operation (since SQL, like
> the
> > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > streams/tables operations) consisting of a stream → table windowed
> grouping
> > followed by a table → stream triggering on every element, basically as
> you
> > described in the previous paragraph.
> >
> >
> That makes sense. Thanks :-)
>
>
> > -Tyler
> >
> >
> > >
> > > I'm looking forward to the second part.
> > >
> > > Cheers, Fabian
> > >
> > >
> > >
> > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > > emails towards the end of the week.
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > wrote:
> > > >
> > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > stuff
> > > > up.
> > > > >
> > > > > -Tyler
> > > > >
> > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Tyler,
> > > > >>
> > > > >> thanks for pushing this effort and including the Flink list.
> > > > >> I haven't managed to read the doc yet, but just wanted to thank
> you
> > > for
> > > > >> the
> > > > >> write-up and let you know that I'm very interested in this
> > discussion.
> > > > >>
> > > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > > busy
> > > > >> getting as many contributions merged before the release is forked
> > off.
> > > > >> When that happened, I'll have more time to read and comment.
> > > > >>
> > > > >> Thanks,
> > > > >> Fabian
> > > > >>
> > > > >>
> > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> > >:
> > > > >>
> > > > >> > Good point, when you start talking about anything less than a
> full
> > > > join,
> > > > >> > triggers get involved to describe how one actually achieves the
> > > > desired
> > > > >> > semantics, and they may end up being tied to just one of the
> > inputs
> > > > >> (e.g.,
> > > > >> > you may only care about the watermark for one side of the join).
> > Am
> > > > >> > expecting us to address these sorts of details more precisely in
> > doc
> > > > #2.
> > > > >> >
> > > > >> > -Tyler
> > > > >> >
> > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > <klk@google.com.invalid
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > There's something to be said about having different triggering
> > > > >> depending
> > > > >> > on
> > > > >> > > which side of a join data comes from, perhaps?
> > > > >> > >
> > > > >> > > (delightful doc, as usual)
> > > > >> > >
> > > > >> > > Kenn
> > > > >> > >
> > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > >> <takidau@google.com.invalid
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > > >> basically
> > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> merges
> > > the
> > > > >> > input
> > > > >> > > > streams into a single output stream. GBK then groups the
> data
> > > > within
> > > > >> > that
> > > > >> > > > single union stream as you might otherwise expect, yielding
> a
> > > > single
> > > > >> > > table.
> > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > >> aggregation,
> > > > >> > > > window merging etc all just act upon the merged stream and
> > > > generate
> > > > >> > what
> > > > >> > > is
> > > > >> > > > effectively a merged table.
> > > > >> > > >
> > > > >> > > > -Tyler
> > > > >> > > >
> > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > >> <lcwik@google.com.invalid
> > > > >> > >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > The doc is a good read.
> > > > >> > > > >
> > > > >> > > > > I think you do a great job of explaining table -> stream,
> > > stream
> > > > >> ->
> > > > >> > > > stream,
> > > > >> > > > > and stream -> table when there is only one stream.
> > > > >> > > > > But when there are multiple streams reading/writing to a
> > > table,
> > > > >> how
> > > > >> > > does
> > > > >> > > > > that impact what occurs?
> > > > >> > > > > For example, with CoGBK you have multiple streams writing
> > to a
> > > > >> table,
> > > > >> > > how
> > > > >> > > > > does that impact window merging?
> > > > >> > > > >
> > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > >> > > <takidau@google.com.invalid
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >> > > > > >
> > > > >> > > > > > Apologies for the big cross post, but I thought this
> might
> > > be
> > > > >> > > something
> > > > >> > > > > all
> > > > >> > > > > > three communities would find relevant.
> > > > >> > > > > >
> > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > Calcite,
> > > > >> > > thanks
> > > > >> > > > to
> > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > > >> conclusion
> > > > >> > > > about
> > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > >> functionality
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > community
> > > > >> > have
> > > > >> > > > been
> > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > amongst
> > > > >> others,
> > > > >> > > > thank
> > > > >> > > > > > you! :-), but from my understanding we still don't have
> a
> > > full
> > > > >> spec
> > > > >> > > for
> > > > >> > > > > how
> > > > >> > > > > > to support robust streaming in SQL (including but not
> > > limited
> > > > >> to,
> > > > >> > > > e.g., a
> > > > >> > > > > > triggers analogue such as EMIT).
> > > > >> > > > > >
> > > > >> > > > > > I've been spending a lot of time thinking about this and
> > > have
> > > > >> some
> > > > >> > > > > opinions
> > > > >> > > > > > about how I think it should look that I've already
> written
> > > > down,
> > > > >> > so I
> > > > >> > > > > > volunteered to try to drive forward agreement on a
> general
> > > > >> > streaming
> > > > >> > > > SQL
> > > > >> > > > > > spec between our three communities (well, technically I
> > > > >> volunteered
> > > > >> > > to
> > > > >> > > > do
> > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > > might
> > > > >> want
> > > > >> > to
> > > > >> > > > > join
> > > > >> > > > > > in since you're going that direction already anyway and
> > will
> > > > >> have
> > > > >> > > > useful
> > > > >> > > > > > insights :-).
> > > > >> > > > > >
> > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > >> > > > > >
> > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> for
> > > > >> context,
> > > > >> > > and
> > > > >> > > > > >    really only mentions SQL in passing. But it describes
> > the
> > > > >> > > > relationship
> > > > >> > > > > >    between the Beam Model and the "streams & tables" way
> > of
> > > > >> > thinking,
> > > > >> > > > > which
> > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > >> streaming in
> > > > >> > > SQL
> > > > >> > > > > > might
> > > > >> > > > > >    look like. Many of you probably already know some or
> > all
> > > of
> > > > >> > what's
> > > > >> > > > in
> > > > >> > > > > > here,
> > > > >> > > > > >    but I felt it was necessary to have it all written
> down
> > > in
> > > > >> order
> > > > >> > > to
> > > > >> > > > > > justify
> > > > >> > > > > >    some of the proposals I wanted to make in the second
> > doc.
> > > > >> > > > > >
> > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> this
> > > doc
> > > > >> is
> > > > >> > > that
> > > > >> > > > it
> > > > >> > > > > >    would become a general specification for what robust
> > > > >> streaming
> > > > >> > SQL
> > > > >> > > > in
> > > > >> > > > > >    Calcite should look like. It would start out as a
> basic
> > > > >> proposal
> > > > >> > > of
> > > > >> > > > > what
> > > > >> > > > > >    things *could* look like (combining both what things
> > look
> > > > >> like
> > > > >> > now
> > > > >> > > > as
> > > > >> > > > > > well
> > > > >> > > > > >    as a set of proposed changes for the future), and we
> > > could
> > > > >> all
> > > > >> > > > iterate
> > > > >> > > > > > on
> > > > >> > > > > >    it together until we get to something we're happy
> with.
> > > > >> > > > > >
> > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > > monster,
> > > > >> > so I
> > > > >> > > > > > figured I'd share it and let folks hack at it with
> > comments
> > > if
> > > > >> they
> > > > >> > > > have
> > > > >> > > > > > any, while I try to get the second doc ready in the
> > > meantime.
> > > > As
> > > > >> > part
> > > > >> > > > of
> > > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> > to
> > > > try
> > > > >> to
> > > > >> > > > gather
> > > > >> > > > > > input on what things are already in flight for streaming
> > SQL
> > > > >> across
> > > > >> > > the
> > > > >> > > > > > various communities, to make sure the proposal captures
> > > > >> everything
> > > > >> > > > that's
> > > > >> > > > > > going on as accurately as it can.
> > > > >> > > > > >
> > > > >> > > > > > If you have any questions or comments, I'm interested to
> > > hear
> > > > >> them.
> > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > Tables":
> > > > >> > > > > >
> > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > >> > > > > >
> > > > >> > > > > > -Tyler
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Shaoxuan Wang <ws...@gmail.com>.

Thanks to Tyler and Fabian for sharing your thoughts.

Regarding to the early/late update control of FLINK. IMO, each dynamic
table can have an EMIT config. For FLINK table-API, this can be easily
implemented in different manners, case by case. For instance, in window
aggregate, we could define "when EMIT a result" via a windowConf per each
window when we create windows. Unfortunately we do not have such
flexibility (as compared with TableAPI) in SQL query, we need find a way to
embed this EMIT config.

Regards,
Shaoxuan


On Fri, May 12, 2017 at 4:28 PM, Fabian Hueske <fh...@gmail.com> wrote:

> 2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Hi Tyler,
> > >
> > > thank you very much for this excellent write-up and the super nice
> > > visualizations!
> > > You are discussing a lot of the things that we have been thinking about
> > as
> > > well from a different perspective.
> > > IMO, yours and the Flink model are pretty much aligned although we use
> a
> > > different terminology (which is not yet completely established). So
> there
> > > should be room for unification ;-)
> > >
> >
> > Good to hear, thanks for giving it a look. :-)
> >
> >
> > > Allow me a few words on the current state in Flink. In the upcoming
> 1.3.0
> > > release, we will have support for group window (TUMBLE, HOP, SESSION),
> > OVER
> > > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > > aggregations. The group windows are triggered by watermark and the over
> > > window and non-windowed aggregations emit for each input record
> > > (AtCount(1)). The window aggregations do not support early or late
> firing
> > > (late records are dropped), so no updates here. However, the
> non-windowed
> > > aggregates produce updates (in acc and acc/retract mode). Based on this
> > we
> > > will work on better control for late updates and early firing as well
> as
> > > joins in the next release.
> > >
> >
> > Impressive. At this rate there's a good chance we'll just be doing
> catchup
> > and thanking you for building everything. ;-) Do you have ideas for what
> > you want your early/late updates control to look like? That's one of the
> > areas I'd like to see better defined for us. And how deep are you going
> > with joins?
> >
>
> Right now (well actually I merged the change 1h ago) we are using a
> QueryConfig object to specify state retention intervals to be able to clean
> up state for inactive keys.
> A running a query looks like this:
>
> // -------
> val env = StreamExecutionEnvironment.getExecutionEnvironment
> val tEnv = TableEnvironment.getTableEnvironment(env)
>
> val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
> state of inactive keys is kept for 12 hours
>
> val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY
> a")
> val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
> boolean flag for acc/retract
>
> env.execute()
> // -------
>
> We plan to use the QueryConfig also to specify early/late updates. Our main
> motivation is to have a uniform and standard SQL for batch and streaming.
> Hence, we have to move the configuration out of the query. But I agree,
> that it would be very nice to be able to include it in the query. I think
> it should not be difficult to optionally support an EMIT clause as well.
>
>
> >
> > > Reading the document, I did not find any major difference in our
> > concepts.
> > > In fact, we are aiming to support the cases you are describing as well.
> > > I have a question though. Would you classify an OVER aggregation as a
> > > stream -> stream or stream -> table operation? It collects records to
> > > aggregate them, but emits one record for each input row. Depending on
> the
> > > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute
> and
> > > emit the result record when the input record is received.
> > >
> >
> > I would call it a composite stream → stream operation (since SQL, like
> the
> > standard Beam/Flink APIs, is a higher level set of constructs than raw
> > streams/tables operations) consisting of a stream → table windowed
> grouping
> > followed by a table → stream triggering on every element, basically as
> you
> > described in the previous paragraph.
> >
> >
> That makes sense. Thanks :-)
>
>
> > -Tyler
> >
> >
> > >
> > > I'm looking forward to the second part.
> > >
> > > Cheers, Fabian
> > >
> > >
> > >
> > > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >
> > > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > > emails towards the end of the week.
> > > >
> > > > -Tyler
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> > wrote:
> > > >
> > > > > No worries, thanks for the heads up. Good luck wrapping all that
> > stuff
> > > > up.
> > > > >
> > > > > -Tyler
> > > > >
> > > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Tyler,
> > > > >>
> > > > >> thanks for pushing this effort and including the Flink list.
> > > > >> I haven't managed to read the doc yet, but just wanted to thank
> you
> > > for
> > > > >> the
> > > > >> write-up and let you know that I'm very interested in this
> > discussion.
> > > > >>
> > > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > > busy
> > > > >> getting as many contributions merged before the release is forked
> > off.
> > > > >> When that happened, I'll have more time to read and comment.
> > > > >>
> > > > >> Thanks,
> > > > >> Fabian
> > > > >>
> > > > >>
> > > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> > >:
> > > > >>
> > > > >> > Good point, when you start talking about anything less than a
> full
> > > > join,
> > > > >> > triggers get involved to describe how one actually achieves the
> > > > desired
> > > > >> > semantics, and they may end up being tied to just one of the
> > inputs
> > > > >> (e.g.,
> > > > >> > you may only care about the watermark for one side of the join).
> > Am
> > > > >> > expecting us to address these sorts of details more precisely in
> > doc
> > > > #2.
> > > > >> >
> > > > >> > -Tyler
> > > > >> >
> > > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > > <klk@google.com.invalid
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > There's something to be said about having different triggering
> > > > >> depending
> > > > >> > on
> > > > >> > > which side of a join data comes from, perhaps?
> > > > >> > >
> > > > >> > > (delightful doc, as usual)
> > > > >> > >
> > > > >> > > Kenn
> > > > >> > >
> > > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > > >> <takidau@google.com.invalid
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > > >> basically
> > > > >> > > > flatten + GBK. Flatten is a non-grouping operation that
> merges
> > > the
> > > > >> > input
> > > > >> > > > streams into a single output stream. GBK then groups the
> data
> > > > within
> > > > >> > that
> > > > >> > > > single union stream as you might otherwise expect, yielding
> a
> > > > single
> > > > >> > > table.
> > > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > > >> aggregation,
> > > > >> > > > window merging etc all just act upon the merged stream and
> > > > generate
> > > > >> > what
> > > > >> > > is
> > > > >> > > > effectively a merged table.
> > > > >> > > >
> > > > >> > > > -Tyler
> > > > >> > > >
> > > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > > >> <lcwik@google.com.invalid
> > > > >> > >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > The doc is a good read.
> > > > >> > > > >
> > > > >> > > > > I think you do a great job of explaining table -> stream,
> > > stream
> > > > >> ->
> > > > >> > > > stream,
> > > > >> > > > > and stream -> table when there is only one stream.
> > > > >> > > > > But when there are multiple streams reading/writing to a
> > > table,
> > > > >> how
> > > > >> > > does
> > > > >> > > > > that impact what occurs?
> > > > >> > > > > For example, with CoGBK you have multiple streams writing
> > to a
> > > > >> table,
> > > > >> > > how
> > > > >> > > > > does that impact window merging?
> > > > >> > > > >
> > > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > > >> > > <takidau@google.com.invalid
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >> > > > > >
> > > > >> > > > > > Apologies for the big cross post, but I thought this
> might
> > > be
> > > > >> > > something
> > > > >> > > > > all
> > > > >> > > > > > three communities would find relevant.
> > > > >> > > > > >
> > > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > > Calcite,
> > > > >> > > thanks
> > > > >> > > > to
> > > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > > >> conclusion
> > > > >> > > > about
> > > > >> > > > > > how to elegantly support the full suite of streaming
> > > > >> functionality
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > > community
> > > > >> > have
> > > > >> > > > been
> > > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> > amongst
> > > > >> others,
> > > > >> > > > thank
> > > > >> > > > > > you! :-), but from my understanding we still don't have
> a
> > > full
> > > > >> spec
> > > > >> > > for
> > > > >> > > > > how
> > > > >> > > > > > to support robust streaming in SQL (including but not
> > > limited
> > > > >> to,
> > > > >> > > > e.g., a
> > > > >> > > > > > triggers analogue such as EMIT).
> > > > >> > > > > >
> > > > >> > > > > > I've been spending a lot of time thinking about this and
> > > have
> > > > >> some
> > > > >> > > > > opinions
> > > > >> > > > > > about how I think it should look that I've already
> written
> > > > down,
> > > > >> > so I
> > > > >> > > > > > volunteered to try to drive forward agreement on a
> general
> > > > >> > streaming
> > > > >> > > > SQL
> > > > >> > > > > > spec between our three communities (well, technically I
> > > > >> volunteered
> > > > >> > > to
> > > > >> > > > do
> > > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > > might
> > > > >> want
> > > > >> > to
> > > > >> > > > > join
> > > > >> > > > > > in since you're going that direction already anyway and
> > will
> > > > >> have
> > > > >> > > > useful
> > > > >> > > > > > insights :-).
> > > > >> > > > > >
> > > > >> > > > > > My plan was to do this by sharing two docs:
> > > > >> > > > > >
> > > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is
> for
> > > > >> context,
> > > > >> > > and
> > > > >> > > > > >    really only mentions SQL in passing. But it describes
> > the
> > > > >> > > > relationship
> > > > >> > > > > >    between the Beam Model and the "streams & tables" way
> > of
> > > > >> > thinking,
> > > > >> > > > > which
> > > > >> > > > > >    turns out to be useful in understanding what robust
> > > > >> streaming in
> > > > >> > > SQL
> > > > >> > > > > > might
> > > > >> > > > > >    look like. Many of you probably already know some or
> > all
> > > of
> > > > >> > what's
> > > > >> > > > in
> > > > >> > > > > > here,
> > > > >> > > > > >    but I felt it was necessary to have it all written
> down
> > > in
> > > > >> order
> > > > >> > > to
> > > > >> > > > > > justify
> > > > >> > > > > >    some of the proposals I wanted to make in the second
> > doc.
> > > > >> > > > > >
> > > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for
> this
> > > doc
> > > > >> is
> > > > >> > > that
> > > > >> > > > it
> > > > >> > > > > >    would become a general specification for what robust
> > > > >> streaming
> > > > >> > SQL
> > > > >> > > > in
> > > > >> > > > > >    Calcite should look like. It would start out as a
> basic
> > > > >> proposal
> > > > >> > > of
> > > > >> > > > > what
> > > > >> > > > > >    things *could* look like (combining both what things
> > look
> > > > >> like
> > > > >> > now
> > > > >> > > > as
> > > > >> > > > > > well
> > > > >> > > > > >    as a set of proposed changes for the future), and we
> > > could
> > > > >> all
> > > > >> > > > iterate
> > > > >> > > > > > on
> > > > >> > > > > >    it together until we get to something we're happy
> with.
> > > > >> > > > > >
> > > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > > monster,
> > > > >> > so I
> > > > >> > > > > > figured I'd share it and let folks hack at it with
> > comments
> > > if
> > > > >> they
> > > > >> > > > have
> > > > >> > > > > > any, while I try to get the second doc ready in the
> > > meantime.
> > > > As
> > > > >> > part
> > > > >> > > > of
> > > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> > to
> > > > try
> > > > >> to
> > > > >> > > > gather
> > > > >> > > > > > input on what things are already in flight for streaming
> > SQL
> > > > >> across
> > > > >> > > the
> > > > >> > > > > > various communities, to make sure the proposal captures
> > > > >> everything
> > > > >> > > > that's
> > > > >> > > > > > going on as accurately as it can.
> > > > >> > > > > >
> > > > >> > > > > > If you have any questions or comments, I'm interested to
> > > hear
> > > > >> them.
> > > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > > Tables":
> > > > >> > > > > >
> > > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > > >> > > > > >
> > > > >> > > > > > -Tyler
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi Tyler,
> >
> > thank you very much for this excellent write-up and the super nice
> > visualizations!
> > You are discussing a lot of the things that we have been thinking about
> as
> > well from a different perspective.
> > IMO, yours and the Flink model are pretty much aligned although we use a
> > different terminology (which is not yet completely established). So there
> > should be room for unification ;-)
> >
>
> Good to hear, thanks for giving it a look. :-)
>
>
> > Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> > release, we will have support for group window (TUMBLE, HOP, SESSION),
> OVER
> > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > aggregations. The group windows are triggered by watermark and the over
> > window and non-windowed aggregations emit for each input record
> > (AtCount(1)). The window aggregations do not support early or late firing
> > (late records are dropped), so no updates here. However, the non-windowed
> > aggregates produce updates (in acc and acc/retract mode). Based on this
> we
> > will work on better control for late updates and early firing as well as
> > joins in the next release.
> >
>
> Impressive. At this rate there's a good chance we'll just be doing catchup
> and thanking you for building everything. ;-) Do you have ideas for what
> you want your early/late updates control to look like? That's one of the
> areas I'd like to see better defined for us. And how deep are you going
> with joins?
>

Right now (well actually I merged the change 1h ago) we are using a
QueryConfig object to specify state retention intervals to be able to clean
up state for inactive keys.
A running a query looks like this:

// -------
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)

val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
state of inactive keys is kept for 12 hours

val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY a")
val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
boolean flag for acc/retract

env.execute()
// -------

We plan to use the QueryConfig also to specify early/late updates. Our main
motivation is to have a uniform and standard SQL for batch and streaming.
Hence, we have to move the configuration out of the query. But I agree,
that it would be very nice to be able to include it in the query. I think
it should not be difficult to optionally support an EMIT clause as well.


>
> > Reading the document, I did not find any major difference in our
> concepts.
> > In fact, we are aiming to support the cases you are describing as well.
> > I have a question though. Would you classify an OVER aggregation as a
> > stream -> stream or stream -> table operation? It collects records to
> > aggregate them, but emits one record for each input row. Depending on the
> > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> > emit the result record when the input record is received.
> >
>
> I would call it a composite stream → stream operation (since SQL, like the
> standard Beam/Flink APIs, is a higher level set of constructs than raw
> streams/tables operations) consisting of a stream → table windowed grouping
> followed by a table → stream triggering on every element, basically as you
> described in the previous paragraph.
>
>
That makes sense. Thanks :-)


> -Tyler
>
>
> >
> > I'm looking forward to the second part.
> >
> > Cheers, Fabian
> >
> >
> >
> > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > emails towards the end of the week.
> > >
> > > -Tyler
> > >
> > >
> > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> wrote:
> > >
> > > > No worries, thanks for the heads up. Good luck wrapping all that
> stuff
> > > up.
> > > >
> > > > -Tyler
> > > >
> > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Tyler,
> > > >>
> > > >> thanks for pushing this effort and including the Flink list.
> > > >> I haven't managed to read the doc yet, but just wanted to thank you
> > for
> > > >> the
> > > >> write-up and let you know that I'm very interested in this
> discussion.
> > > >>
> > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > busy
> > > >> getting as many contributions merged before the release is forked
> off.
> > > >> When that happened, I'll have more time to read and comment.
> > > >>
> > > >> Thanks,
> > > >> Fabian
> > > >>
> > > >>
> > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > >>
> > > >> > Good point, when you start talking about anything less than a full
> > > join,
> > > >> > triggers get involved to describe how one actually achieves the
> > > desired
> > > >> > semantics, and they may end up being tied to just one of the
> inputs
> > > >> (e.g.,
> > > >> > you may only care about the watermark for one side of the join).
> Am
> > > >> > expecting us to address these sorts of details more precisely in
> doc
> > > #2.
> > > >> >
> > > >> > -Tyler
> > > >> >
> > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > <klk@google.com.invalid
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > There's something to be said about having different triggering
> > > >> depending
> > > >> > on
> > > >> > > which side of a join data comes from, perhaps?
> > > >> > >
> > > >> > > (delightful doc, as usual)
> > > >> > >
> > > >> > > Kenn
> > > >> > >
> > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > >> <takidau@google.com.invalid
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > >> basically
> > > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> > the
> > > >> > input
> > > >> > > > streams into a single output stream. GBK then groups the data
> > > within
> > > >> > that
> > > >> > > > single union stream as you might otherwise expect, yielding a
> > > single
> > > >> > > table.
> > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > >> aggregation,
> > > >> > > > window merging etc all just act upon the merged stream and
> > > generate
> > > >> > what
> > > >> > > is
> > > >> > > > effectively a merged table.
> > > >> > > >
> > > >> > > > -Tyler
> > > >> > > >
> > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > >> <lcwik@google.com.invalid
> > > >> > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > The doc is a good read.
> > > >> > > > >
> > > >> > > > > I think you do a great job of explaining table -> stream,
> > stream
> > > >> ->
> > > >> > > > stream,
> > > >> > > > > and stream -> table when there is only one stream.
> > > >> > > > > But when there are multiple streams reading/writing to a
> > table,
> > > >> how
> > > >> > > does
> > > >> > > > > that impact what occurs?
> > > >> > > > > For example, with CoGBK you have multiple streams writing
> to a
> > > >> table,
> > > >> > > how
> > > >> > > > > does that impact window merging?
> > > >> > > > >
> > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > >> > > <takidau@google.com.invalid
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > >> > > > > >
> > > >> > > > > > Apologies for the big cross post, but I thought this might
> > be
> > > >> > > something
> > > >> > > > > all
> > > >> > > > > > three communities would find relevant.
> > > >> > > > > >
> > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > Calcite,
> > > >> > > thanks
> > > >> > > > to
> > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > >> conclusion
> > > >> > > > about
> > > >> > > > > > how to elegantly support the full suite of streaming
> > > >> functionality
> > > >> > in
> > > >> > > > the
> > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > community
> > > >> > have
> > > >> > > > been
> > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> amongst
> > > >> others,
> > > >> > > > thank
> > > >> > > > > > you! :-), but from my understanding we still don't have a
> > full
> > > >> spec
> > > >> > > for
> > > >> > > > > how
> > > >> > > > > > to support robust streaming in SQL (including but not
> > limited
> > > >> to,
> > > >> > > > e.g., a
> > > >> > > > > > triggers analogue such as EMIT).
> > > >> > > > > >
> > > >> > > > > > I've been spending a lot of time thinking about this and
> > have
> > > >> some
> > > >> > > > > opinions
> > > >> > > > > > about how I think it should look that I've already written
> > > down,
> > > >> > so I
> > > >> > > > > > volunteered to try to drive forward agreement on a general
> > > >> > streaming
> > > >> > > > SQL
> > > >> > > > > > spec between our three communities (well, technically I
> > > >> volunteered
> > > >> > > to
> > > >> > > > do
> > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > might
> > > >> want
> > > >> > to
> > > >> > > > > join
> > > >> > > > > > in since you're going that direction already anyway and
> will
> > > >> have
> > > >> > > > useful
> > > >> > > > > > insights :-).
> > > >> > > > > >
> > > >> > > > > > My plan was to do this by sharing two docs:
> > > >> > > > > >
> > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > > >> context,
> > > >> > > and
> > > >> > > > > >    really only mentions SQL in passing. But it describes
> the
> > > >> > > > relationship
> > > >> > > > > >    between the Beam Model and the "streams & tables" way
> of
> > > >> > thinking,
> > > >> > > > > which
> > > >> > > > > >    turns out to be useful in understanding what robust
> > > >> streaming in
> > > >> > > SQL
> > > >> > > > > > might
> > > >> > > > > >    look like. Many of you probably already know some or
> all
> > of
> > > >> > what's
> > > >> > > > in
> > > >> > > > > > here,
> > > >> > > > > >    but I felt it was necessary to have it all written down
> > in
> > > >> order
> > > >> > > to
> > > >> > > > > > justify
> > > >> > > > > >    some of the proposals I wanted to make in the second
> doc.
> > > >> > > > > >
> > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> > doc
> > > >> is
> > > >> > > that
> > > >> > > > it
> > > >> > > > > >    would become a general specification for what robust
> > > >> streaming
> > > >> > SQL
> > > >> > > > in
> > > >> > > > > >    Calcite should look like. It would start out as a basic
> > > >> proposal
> > > >> > > of
> > > >> > > > > what
> > > >> > > > > >    things *could* look like (combining both what things
> look
> > > >> like
> > > >> > now
> > > >> > > > as
> > > >> > > > > > well
> > > >> > > > > >    as a set of proposed changes for the future), and we
> > could
> > > >> all
> > > >> > > > iterate
> > > >> > > > > > on
> > > >> > > > > >    it together until we get to something we're happy with.
> > > >> > > > > >
> > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > monster,
> > > >> > so I
> > > >> > > > > > figured I'd share it and let folks hack at it with
> comments
> > if
> > > >> they
> > > >> > > > have
> > > >> > > > > > any, while I try to get the second doc ready in the
> > meantime.
> > > As
> > > >> > part
> > > >> > > > of
> > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> to
> > > try
> > > >> to
> > > >> > > > gather
> > > >> > > > > > input on what things are already in flight for streaming
> SQL
> > > >> across
> > > >> > > the
> > > >> > > > > > various communities, to make sure the proposal captures
> > > >> everything
> > > >> > > > that's
> > > >> > > > > > going on as accurately as it can.
> > > >> > > > > >
> > > >> > > > > > If you have any questions or comments, I'm interested to
> > hear
> > > >> them.
> > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > Tables":
> > > >> > > > > >
> > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > >> > > > > >
> > > >> > > > > > -Tyler
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi Tyler,
> >
> > thank you very much for this excellent write-up and the super nice
> > visualizations!
> > You are discussing a lot of the things that we have been thinking about
> as
> > well from a different perspective.
> > IMO, yours and the Flink model are pretty much aligned although we use a
> > different terminology (which is not yet completely established). So there
> > should be room for unification ;-)
> >
>
> Good to hear, thanks for giving it a look. :-)
>
>
> > Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> > release, we will have support for group window (TUMBLE, HOP, SESSION),
> OVER
> > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > aggregations. The group windows are triggered by watermark and the over
> > window and non-windowed aggregations emit for each input record
> > (AtCount(1)). The window aggregations do not support early or late firing
> > (late records are dropped), so no updates here. However, the non-windowed
> > aggregates produce updates (in acc and acc/retract mode). Based on this
> we
> > will work on better control for late updates and early firing as well as
> > joins in the next release.
> >
>
> Impressive. At this rate there's a good chance we'll just be doing catchup
> and thanking you for building everything. ;-) Do you have ideas for what
> you want your early/late updates control to look like? That's one of the
> areas I'd like to see better defined for us. And how deep are you going
> with joins?
>

Right now (well actually I merged the change 1h ago) we are using a
QueryConfig object to specify state retention intervals to be able to clean
up state for inactive keys.
A running a query looks like this:

// -------
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)

val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
state of inactive keys is kept for 12 hours

val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY a")
val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
boolean flag for acc/retract

env.execute()
// -------

We plan to use the QueryConfig also to specify early/late updates. Our main
motivation is to have a uniform and standard SQL for batch and streaming.
Hence, we have to move the configuration out of the query. But I agree,
that it would be very nice to be able to include it in the query. I think
it should not be difficult to optionally support an EMIT clause as well.


>
> > Reading the document, I did not find any major difference in our
> concepts.
> > In fact, we are aiming to support the cases you are describing as well.
> > I have a question though. Would you classify an OVER aggregation as a
> > stream -> stream or stream -> table operation? It collects records to
> > aggregate them, but emits one record for each input row. Depending on the
> > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> > emit the result record when the input record is received.
> >
>
> I would call it a composite stream → stream operation (since SQL, like the
> standard Beam/Flink APIs, is a higher level set of constructs than raw
> streams/tables operations) consisting of a stream → table windowed grouping
> followed by a table → stream triggering on every element, basically as you
> described in the previous paragraph.
>
>
That makes sense. Thanks :-)


> -Tyler
>
>
> >
> > I'm looking forward to the second part.
> >
> > Cheers, Fabian
> >
> >
> >
> > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > emails towards the end of the week.
> > >
> > > -Tyler
> > >
> > >
> > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> wrote:
> > >
> > > > No worries, thanks for the heads up. Good luck wrapping all that
> stuff
> > > up.
> > > >
> > > > -Tyler
> > > >
> > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Tyler,
> > > >>
> > > >> thanks for pushing this effort and including the Flink list.
> > > >> I haven't managed to read the doc yet, but just wanted to thank you
> > for
> > > >> the
> > > >> write-up and let you know that I'm very interested in this
> discussion.
> > > >>
> > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > busy
> > > >> getting as many contributions merged before the release is forked
> off.
> > > >> When that happened, I'll have more time to read and comment.
> > > >>
> > > >> Thanks,
> > > >> Fabian
> > > >>
> > > >>
> > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > >>
> > > >> > Good point, when you start talking about anything less than a full
> > > join,
> > > >> > triggers get involved to describe how one actually achieves the
> > > desired
> > > >> > semantics, and they may end up being tied to just one of the
> inputs
> > > >> (e.g.,
> > > >> > you may only care about the watermark for one side of the join).
> Am
> > > >> > expecting us to address these sorts of details more precisely in
> doc
> > > #2.
> > > >> >
> > > >> > -Tyler
> > > >> >
> > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > <klk@google.com.invalid
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > There's something to be said about having different triggering
> > > >> depending
> > > >> > on
> > > >> > > which side of a join data comes from, perhaps?
> > > >> > >
> > > >> > > (delightful doc, as usual)
> > > >> > >
> > > >> > > Kenn
> > > >> > >
> > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > >> <takidau@google.com.invalid
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > >> basically
> > > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> > the
> > > >> > input
> > > >> > > > streams into a single output stream. GBK then groups the data
> > > within
> > > >> > that
> > > >> > > > single union stream as you might otherwise expect, yielding a
> > > single
> > > >> > > table.
> > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > >> aggregation,
> > > >> > > > window merging etc all just act upon the merged stream and
> > > generate
> > > >> > what
> > > >> > > is
> > > >> > > > effectively a merged table.
> > > >> > > >
> > > >> > > > -Tyler
> > > >> > > >
> > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > >> <lcwik@google.com.invalid
> > > >> > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > The doc is a good read.
> > > >> > > > >
> > > >> > > > > I think you do a great job of explaining table -> stream,
> > stream
> > > >> ->
> > > >> > > > stream,
> > > >> > > > > and stream -> table when there is only one stream.
> > > >> > > > > But when there are multiple streams reading/writing to a
> > table,
> > > >> how
> > > >> > > does
> > > >> > > > > that impact what occurs?
> > > >> > > > > For example, with CoGBK you have multiple streams writing
> to a
> > > >> table,
> > > >> > > how
> > > >> > > > > does that impact window merging?
> > > >> > > > >
> > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > >> > > <takidau@google.com.invalid
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > >> > > > > >
> > > >> > > > > > Apologies for the big cross post, but I thought this might
> > be
> > > >> > > something
> > > >> > > > > all
> > > >> > > > > > three communities would find relevant.
> > > >> > > > > >
> > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > Calcite,
> > > >> > > thanks
> > > >> > > > to
> > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > >> conclusion
> > > >> > > > about
> > > >> > > > > > how to elegantly support the full suite of streaming
> > > >> functionality
> > > >> > in
> > > >> > > > the
> > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > community
> > > >> > have
> > > >> > > > been
> > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> amongst
> > > >> others,
> > > >> > > > thank
> > > >> > > > > > you! :-), but from my understanding we still don't have a
> > full
> > > >> spec
> > > >> > > for
> > > >> > > > > how
> > > >> > > > > > to support robust streaming in SQL (including but not
> > limited
> > > >> to,
> > > >> > > > e.g., a
> > > >> > > > > > triggers analogue such as EMIT).
> > > >> > > > > >
> > > >> > > > > > I've been spending a lot of time thinking about this and
> > have
> > > >> some
> > > >> > > > > opinions
> > > >> > > > > > about how I think it should look that I've already written
> > > down,
> > > >> > so I
> > > >> > > > > > volunteered to try to drive forward agreement on a general
> > > >> > streaming
> > > >> > > > SQL
> > > >> > > > > > spec between our three communities (well, technically I
> > > >> volunteered
> > > >> > > to
> > > >> > > > do
> > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > might
> > > >> want
> > > >> > to
> > > >> > > > > join
> > > >> > > > > > in since you're going that direction already anyway and
> will
> > > >> have
> > > >> > > > useful
> > > >> > > > > > insights :-).
> > > >> > > > > >
> > > >> > > > > > My plan was to do this by sharing two docs:
> > > >> > > > > >
> > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > > >> context,
> > > >> > > and
> > > >> > > > > >    really only mentions SQL in passing. But it describes
> the
> > > >> > > > relationship
> > > >> > > > > >    between the Beam Model and the "streams & tables" way
> of
> > > >> > thinking,
> > > >> > > > > which
> > > >> > > > > >    turns out to be useful in understanding what robust
> > > >> streaming in
> > > >> > > SQL
> > > >> > > > > > might
> > > >> > > > > >    look like. Many of you probably already know some or
> all
> > of
> > > >> > what's
> > > >> > > > in
> > > >> > > > > > here,
> > > >> > > > > >    but I felt it was necessary to have it all written down
> > in
> > > >> order
> > > >> > > to
> > > >> > > > > > justify
> > > >> > > > > >    some of the proposals I wanted to make in the second
> doc.
> > > >> > > > > >
> > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> > doc
> > > >> is
> > > >> > > that
> > > >> > > > it
> > > >> > > > > >    would become a general specification for what robust
> > > >> streaming
> > > >> > SQL
> > > >> > > > in
> > > >> > > > > >    Calcite should look like. It would start out as a basic
> > > >> proposal
> > > >> > > of
> > > >> > > > > what
> > > >> > > > > >    things *could* look like (combining both what things
> look
> > > >> like
> > > >> > now
> > > >> > > > as
> > > >> > > > > > well
> > > >> > > > > >    as a set of proposed changes for the future), and we
> > could
> > > >> all
> > > >> > > > iterate
> > > >> > > > > > on
> > > >> > > > > >    it together until we get to something we're happy with.
> > > >> > > > > >
> > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > monster,
> > > >> > so I
> > > >> > > > > > figured I'd share it and let folks hack at it with
> comments
> > if
> > > >> they
> > > >> > > > have
> > > >> > > > > > any, while I try to get the second doc ready in the
> > meantime.
> > > As
> > > >> > part
> > > >> > > > of
> > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> to
> > > try
> > > >> to
> > > >> > > > gather
> > > >> > > > > > input on what things are already in flight for streaming
> SQL
> > > >> across
> > > >> > > the
> > > >> > > > > > various communities, to make sure the proposal captures
> > > >> everything
> > > >> > > > that's
> > > >> > > > > > going on as accurately as it can.
> > > >> > > > > >
> > > >> > > > > > If you have any questions or comments, I'm interested to
> > hear
> > > >> them.
> > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > Tables":
> > > >> > > > > >
> > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > >> > > > > >
> > > >> > > > > > -Tyler
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

2017-05-11 7:14 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi Tyler,
> >
> > thank you very much for this excellent write-up and the super nice
> > visualizations!
> > You are discussing a lot of the things that we have been thinking about
> as
> > well from a different perspective.
> > IMO, yours and the Flink model are pretty much aligned although we use a
> > different terminology (which is not yet completely established). So there
> > should be room for unification ;-)
> >
>
> Good to hear, thanks for giving it a look. :-)
>
>
> > Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> > release, we will have support for group window (TUMBLE, HOP, SESSION),
> OVER
> > RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> > aggregations. The group windows are triggered by watermark and the over
> > window and non-windowed aggregations emit for each input record
> > (AtCount(1)). The window aggregations do not support early or late firing
> > (late records are dropped), so no updates here. However, the non-windowed
> > aggregates produce updates (in acc and acc/retract mode). Based on this
> we
> > will work on better control for late updates and early firing as well as
> > joins in the next release.
> >
>
> Impressive. At this rate there's a good chance we'll just be doing catchup
> and thanking you for building everything. ;-) Do you have ideas for what
> you want your early/late updates control to look like? That's one of the
> areas I'd like to see better defined for us. And how deep are you going
> with joins?
>

Right now (well actually I merged the change 1h ago) we are using a
QueryConfig object to specify state retention intervals to be able to clean
up state for inactive keys.
A running a query looks like this:

// -------
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)

val qConf = tEnv.queryConfig.withIdleStateRetentionTime(Time.hours(12)) //
state of inactive keys is kept for 12 hours

val t: Table = tEnv.sql("SELECT a, COUNT(*) AS cnt FROM MyTable GROUP BY a")
val stream: DataStream[(Boolean, Row)] = t.toRetractStream[Row](qConf) //
boolean flag for acc/retract

env.execute()
// -------

We plan to use the QueryConfig also to specify early/late updates. Our main
motivation is to have a uniform and standard SQL for batch and streaming.
Hence, we have to move the configuration out of the query. But I agree,
that it would be very nice to be able to include it in the query. I think
it should not be difficult to optionally support an EMIT clause as well.


>
> > Reading the document, I did not find any major difference in our
> concepts.
> > In fact, we are aiming to support the cases you are describing as well.
> > I have a question though. Would you classify an OVER aggregation as a
> > stream -> stream or stream -> table operation? It collects records to
> > aggregate them, but emits one record for each input row. Depending on the
> > window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> > emit the result record when the input record is received.
> >
>
> I would call it a composite stream → stream operation (since SQL, like the
> standard Beam/Flink APIs, is a higher level set of constructs than raw
> streams/tables operations) consisting of a stream → table windowed grouping
> followed by a table → stream triggering on every element, basically as you
> described in the previous paragraph.
>
>
That makes sense. Thanks :-)


> -Tyler
>
>
> >
> > I'm looking forward to the second part.
> >
> > Cheers, Fabian
> >
> >
> >
> > 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >
> > > Any thoughts here Fabian? I'm planning to start sending out some more
> > > emails towards the end of the week.
> > >
> > > -Tyler
> > >
> > >
> > > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com>
> wrote:
> > >
> > > > No worries, thanks for the heads up. Good luck wrapping all that
> stuff
> > > up.
> > > >
> > > > -Tyler
> > > >
> > > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Tyler,
> > > >>
> > > >> thanks for pushing this effort and including the Flink list.
> > > >> I haven't managed to read the doc yet, but just wanted to thank you
> > for
> > > >> the
> > > >> write-up and let you know that I'm very interested in this
> discussion.
> > > >>
> > > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> > busy
> > > >> getting as many contributions merged before the release is forked
> off.
> > > >> When that happened, I'll have more time to read and comment.
> > > >>
> > > >> Thanks,
> > > >> Fabian
> > > >>
> > > >>
> > > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <takidau@google.com.invalid
> >:
> > > >>
> > > >> > Good point, when you start talking about anything less than a full
> > > join,
> > > >> > triggers get involved to describe how one actually achieves the
> > > desired
> > > >> > semantics, and they may end up being tied to just one of the
> inputs
> > > >> (e.g.,
> > > >> > you may only care about the watermark for one side of the join).
> Am
> > > >> > expecting us to address these sorts of details more precisely in
> doc
> > > #2.
> > > >> >
> > > >> > -Tyler
> > > >> >
> > > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > > <klk@google.com.invalid
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > There's something to be said about having different triggering
> > > >> depending
> > > >> > on
> > > >> > > which side of a join data comes from, perhaps?
> > > >> > >
> > > >> > > (delightful doc, as usual)
> > > >> > >
> > > >> > > Kenn
> > > >> > >
> > > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > > >> <takidau@google.com.invalid
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > > >> basically
> > > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> > the
> > > >> > input
> > > >> > > > streams into a single output stream. GBK then groups the data
> > > within
> > > >> > that
> > > >> > > > single union stream as you might otherwise expect, yielding a
> > > single
> > > >> > > table.
> > > >> > > > So I think it doesn't really impact things much. Grouping,
> > > >> aggregation,
> > > >> > > > window merging etc all just act upon the merged stream and
> > > generate
> > > >> > what
> > > >> > > is
> > > >> > > > effectively a merged table.
> > > >> > > >
> > > >> > > > -Tyler
> > > >> > > >
> > > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > > >> <lcwik@google.com.invalid
> > > >> > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > The doc is a good read.
> > > >> > > > >
> > > >> > > > > I think you do a great job of explaining table -> stream,
> > stream
> > > >> ->
> > > >> > > > stream,
> > > >> > > > > and stream -> table when there is only one stream.
> > > >> > > > > But when there are multiple streams reading/writing to a
> > table,
> > > >> how
> > > >> > > does
> > > >> > > > > that impact what occurs?
> > > >> > > > > For example, with CoGBK you have multiple streams writing
> to a
> > > >> table,
> > > >> > > how
> > > >> > > > > does that impact window merging?
> > > >> > > > >
> > > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > >> > > <takidau@google.com.invalid
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > >> > > > > >
> > > >> > > > > > Apologies for the big cross post, but I thought this might
> > be
> > > >> > > something
> > > >> > > > > all
> > > >> > > > > > three communities would find relevant.
> > > >> > > > > >
> > > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > > Calcite,
> > > >> > > thanks
> > > >> > > > to
> > > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > > >> conclusion
> > > >> > > > about
> > > >> > > > > > how to elegantly support the full suite of streaming
> > > >> functionality
> > > >> > in
> > > >> > > > the
> > > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > > community
> > > >> > have
> > > >> > > > been
> > > >> > > > > > pushing on this (e.g., adding windowing constructs,
> amongst
> > > >> others,
> > > >> > > > thank
> > > >> > > > > > you! :-), but from my understanding we still don't have a
> > full
> > > >> spec
> > > >> > > for
> > > >> > > > > how
> > > >> > > > > > to support robust streaming in SQL (including but not
> > limited
> > > >> to,
> > > >> > > > e.g., a
> > > >> > > > > > triggers analogue such as EMIT).
> > > >> > > > > >
> > > >> > > > > > I've been spending a lot of time thinking about this and
> > have
> > > >> some
> > > >> > > > > opinions
> > > >> > > > > > about how I think it should look that I've already written
> > > down,
> > > >> > so I
> > > >> > > > > > volunteered to try to drive forward agreement on a general
> > > >> > streaming
> > > >> > > > SQL
> > > >> > > > > > spec between our three communities (well, technically I
> > > >> volunteered
> > > >> > > to
> > > >> > > > do
> > > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> > might
> > > >> want
> > > >> > to
> > > >> > > > > join
> > > >> > > > > > in since you're going that direction already anyway and
> will
> > > >> have
> > > >> > > > useful
> > > >> > > > > > insights :-).
> > > >> > > > > >
> > > >> > > > > > My plan was to do this by sharing two docs:
> > > >> > > > > >
> > > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > > >> context,
> > > >> > > and
> > > >> > > > > >    really only mentions SQL in passing. But it describes
> the
> > > >> > > > relationship
> > > >> > > > > >    between the Beam Model and the "streams & tables" way
> of
> > > >> > thinking,
> > > >> > > > > which
> > > >> > > > > >    turns out to be useful in understanding what robust
> > > >> streaming in
> > > >> > > SQL
> > > >> > > > > > might
> > > >> > > > > >    look like. Many of you probably already know some or
> all
> > of
> > > >> > what's
> > > >> > > > in
> > > >> > > > > > here,
> > > >> > > > > >    but I felt it was necessary to have it all written down
> > in
> > > >> order
> > > >> > > to
> > > >> > > > > > justify
> > > >> > > > > >    some of the proposals I wanted to make in the second
> doc.
> > > >> > > > > >
> > > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> > doc
> > > >> is
> > > >> > > that
> > > >> > > > it
> > > >> > > > > >    would become a general specification for what robust
> > > >> streaming
> > > >> > SQL
> > > >> > > > in
> > > >> > > > > >    Calcite should look like. It would start out as a basic
> > > >> proposal
> > > >> > > of
> > > >> > > > > what
> > > >> > > > > >    things *could* look like (combining both what things
> look
> > > >> like
> > > >> > now
> > > >> > > > as
> > > >> > > > > > well
> > > >> > > > > >    as a set of proposed changes for the future), and we
> > could
> > > >> all
> > > >> > > > iterate
> > > >> > > > > > on
> > > >> > > > > >    it together until we get to something we're happy with.
> > > >> > > > > >
> > > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > > monster,
> > > >> > so I
> > > >> > > > > > figured I'd share it and let folks hack at it with
> comments
> > if
> > > >> they
> > > >> > > > have
> > > >> > > > > > any, while I try to get the second doc ready in the
> > meantime.
> > > As
> > > >> > part
> > > >> > > > of
> > > >> > > > > > getting doc #2 ready, I'll be starting a separate thread
> to
> > > try
> > > >> to
> > > >> > > > gather
> > > >> > > > > > input on what things are already in flight for streaming
> SQL
> > > >> across
> > > >> > > the
> > > >> > > > > > various communities, to make sure the proposal captures
> > > >> everything
> > > >> > > > that's
> > > >> > > > > > going on as accurately as it can.
> > > >> > > > > >
> > > >> > > > > > If you have any questions or comments, I'm interested to
> > hear
> > > >> them.
> > > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> > Tables":
> > > >> > > > > >
> > > >> > > > > >   http://s.apache.org/beam-streams-tables
> > > >> > > > > >
> > > >> > > > > > -Tyler
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thank you very much for this excellent write-up and the super nice
> visualizations!
> You are discussing a lot of the things that we have been thinking about as
> well from a different perspective.
> IMO, yours and the Flink model are pretty much aligned although we use a
> different terminology (which is not yet completely established). So there
> should be room for unification ;-)
>

Good to hear, thanks for giving it a look. :-)


> Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
> RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> aggregations. The group windows are triggered by watermark and the over
> window and non-windowed aggregations emit for each input record
> (AtCount(1)). The window aggregations do not support early or late firing
> (late records are dropped), so no updates here. However, the non-windowed
> aggregates produce updates (in acc and acc/retract mode). Based on this we
> will work on better control for late updates and early firing as well as
> joins in the next release.
>

Impressive. At this rate there's a good chance we'll just be doing catchup
and thanking you for building everything. ;-) Do you have ideas for what
you want your early/late updates control to look like? That's one of the
areas I'd like to see better defined for us. And how deep are you going
with joins?


> Reading the document, I did not find any major difference in our concepts.
> In fact, we are aiming to support the cases you are describing as well.
> I have a question though. Would you classify an OVER aggregation as a
> stream -> stream or stream -> table operation? It collects records to
> aggregate them, but emits one record for each input row. Depending on the
> window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> emit the result record when the input record is received.
>

I would call it a composite stream → stream operation (since SQL, like the
standard Beam/Flink APIs, is a higher level set of constructs than raw
streams/tables operations) consisting of a stream → table windowed grouping
followed by a table → stream triggering on every element, basically as you
described in the previous paragraph.

-Tyler


>
> I'm looking forward to the second part.
>
> Cheers, Fabian
>
>
>
> 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <klk@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
> > >> > > which side of a join data comes from, perhaps?
> > >> > >
> > >> > > (delightful doc, as usual)
> > >> > >
> > >> > > Kenn
> > >> > >
> > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > >> <takidau@google.com.invalid
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > >> basically
> > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> the
> > >> > input
> > >> > > > streams into a single output stream. GBK then groups the data
> > within
> > >> > that
> > >> > > > single union stream as you might otherwise expect, yielding a
> > single
> > >> > > table.
> > >> > > > So I think it doesn't really impact things much. Grouping,
> > >> aggregation,
> > >> > > > window merging etc all just act upon the merged stream and
> > generate
> > >> > what
> > >> > > is
> > >> > > > effectively a merged table.
> > >> > > >
> > >> > > > -Tyler
> > >> > > >
> > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > >> <lcwik@google.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The doc is a good read.
> > >> > > > >
> > >> > > > > I think you do a great job of explaining table -> stream,
> stream
> > >> ->
> > >> > > > stream,
> > >> > > > > and stream -> table when there is only one stream.
> > >> > > > > But when there are multiple streams reading/writing to a
> table,
> > >> how
> > >> > > does
> > >> > > > > that impact what occurs?
> > >> > > > > For example, with CoGBK you have multiple streams writing to a
> > >> table,
> > >> > > how
> > >> > > > > does that impact window merging?
> > >> > > > >
> > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > >> > > <takidau@google.com.invalid
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > >> > > > > >
> > >> > > > > > Apologies for the big cross post, but I thought this might
> be
> > >> > > something
> > >> > > > > all
> > >> > > > > > three communities would find relevant.
> > >> > > > > >
> > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > Calcite,
> > >> > > thanks
> > >> > > > to
> > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > >> conclusion
> > >> > > > about
> > >> > > > > > how to elegantly support the full suite of streaming
> > >> functionality
> > >> > in
> > >> > > > the
> > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > community
> > >> > have
> > >> > > > been
> > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> > >> others,
> > >> > > > thank
> > >> > > > > > you! :-), but from my understanding we still don't have a
> full
> > >> spec
> > >> > > for
> > >> > > > > how
> > >> > > > > > to support robust streaming in SQL (including but not
> limited
> > >> to,
> > >> > > > e.g., a
> > >> > > > > > triggers analogue such as EMIT).
> > >> > > > > >
> > >> > > > > > I've been spending a lot of time thinking about this and
> have
> > >> some
> > >> > > > > opinions
> > >> > > > > > about how I think it should look that I've already written
> > down,
> > >> > so I
> > >> > > > > > volunteered to try to drive forward agreement on a general
> > >> > streaming
> > >> > > > SQL
> > >> > > > > > spec between our three communities (well, technically I
> > >> volunteered
> > >> > > to
> > >> > > > do
> > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> might
> > >> want
> > >> > to
> > >> > > > > join
> > >> > > > > > in since you're going that direction already anyway and will
> > >> have
> > >> > > > useful
> > >> > > > > > insights :-).
> > >> > > > > >
> > >> > > > > > My plan was to do this by sharing two docs:
> > >> > > > > >
> > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > >> context,
> > >> > > and
> > >> > > > > >    really only mentions SQL in passing. But it describes the
> > >> > > > relationship
> > >> > > > > >    between the Beam Model and the "streams & tables" way of
> > >> > thinking,
> > >> > > > > which
> > >> > > > > >    turns out to be useful in understanding what robust
> > >> streaming in
> > >> > > SQL
> > >> > > > > > might
> > >> > > > > >    look like. Many of you probably already know some or all
> of
> > >> > what's
> > >> > > > in
> > >> > > > > > here,
> > >> > > > > >    but I felt it was necessary to have it all written down
> in
> > >> order
> > >> > > to
> > >> > > > > > justify
> > >> > > > > >    some of the proposals I wanted to make in the second doc.
> > >> > > > > >
> > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> doc
> > >> is
> > >> > > that
> > >> > > > it
> > >> > > > > >    would become a general specification for what robust
> > >> streaming
> > >> > SQL
> > >> > > > in
> > >> > > > > >    Calcite should look like. It would start out as a basic
> > >> proposal
> > >> > > of
> > >> > > > > what
> > >> > > > > >    things *could* look like (combining both what things look
> > >> like
> > >> > now
> > >> > > > as
> > >> > > > > > well
> > >> > > > > >    as a set of proposed changes for the future), and we
> could
> > >> all
> > >> > > > iterate
> > >> > > > > > on
> > >> > > > > >    it together until we get to something we're happy with.
> > >> > > > > >
> > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > monster,
> > >> > so I
> > >> > > > > > figured I'd share it and let folks hack at it with comments
> if
> > >> they
> > >> > > > have
> > >> > > > > > any, while I try to get the second doc ready in the
> meantime.
> > As
> > >> > part
> > >> > > > of
> > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> > try
> > >> to
> > >> > > > gather
> > >> > > > > > input on what things are already in flight for streaming SQL
> > >> across
> > >> > > the
> > >> > > > > > various communities, to make sure the proposal captures
> > >> everything
> > >> > > > that's
> > >> > > > > > going on as accurately as it can.
> > >> > > > > >
> > >> > > > > > If you have any questions or comments, I'm interested to
> hear
> > >> them.
> > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> Tables":
> > >> > > > > >
> > >> > > > > >   http://s.apache.org/beam-streams-tables
> > >> > > > > >
> > >> > > > > > -Tyler
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thank you very much for this excellent write-up and the super nice
> visualizations!
> You are discussing a lot of the things that we have been thinking about as
> well from a different perspective.
> IMO, yours and the Flink model are pretty much aligned although we use a
> different terminology (which is not yet completely established). So there
> should be room for unification ;-)
>

Good to hear, thanks for giving it a look. :-)


> Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
> RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> aggregations. The group windows are triggered by watermark and the over
> window and non-windowed aggregations emit for each input record
> (AtCount(1)). The window aggregations do not support early or late firing
> (late records are dropped), so no updates here. However, the non-windowed
> aggregates produce updates (in acc and acc/retract mode). Based on this we
> will work on better control for late updates and early firing as well as
> joins in the next release.
>

Impressive. At this rate there's a good chance we'll just be doing catchup
and thanking you for building everything. ;-) Do you have ideas for what
you want your early/late updates control to look like? That's one of the
areas I'd like to see better defined for us. And how deep are you going
with joins?


> Reading the document, I did not find any major difference in our concepts.
> In fact, we are aiming to support the cases you are describing as well.
> I have a question though. Would you classify an OVER aggregation as a
> stream -> stream or stream -> table operation? It collects records to
> aggregate them, but emits one record for each input row. Depending on the
> window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> emit the result record when the input record is received.
>

I would call it a composite stream → stream operation (since SQL, like the
standard Beam/Flink APIs, is a higher level set of constructs than raw
streams/tables operations) consisting of a stream → table windowed grouping
followed by a table → stream triggering on every element, basically as you
described in the previous paragraph.

-Tyler


>
> I'm looking forward to the second part.
>
> Cheers, Fabian
>
>
>
> 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <klk@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
> > >> > > which side of a join data comes from, perhaps?
> > >> > >
> > >> > > (delightful doc, as usual)
> > >> > >
> > >> > > Kenn
> > >> > >
> > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > >> <takidau@google.com.invalid
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > >> basically
> > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> the
> > >> > input
> > >> > > > streams into a single output stream. GBK then groups the data
> > within
> > >> > that
> > >> > > > single union stream as you might otherwise expect, yielding a
> > single
> > >> > > table.
> > >> > > > So I think it doesn't really impact things much. Grouping,
> > >> aggregation,
> > >> > > > window merging etc all just act upon the merged stream and
> > generate
> > >> > what
> > >> > > is
> > >> > > > effectively a merged table.
> > >> > > >
> > >> > > > -Tyler
> > >> > > >
> > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > >> <lcwik@google.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The doc is a good read.
> > >> > > > >
> > >> > > > > I think you do a great job of explaining table -> stream,
> stream
> > >> ->
> > >> > > > stream,
> > >> > > > > and stream -> table when there is only one stream.
> > >> > > > > But when there are multiple streams reading/writing to a
> table,
> > >> how
> > >> > > does
> > >> > > > > that impact what occurs?
> > >> > > > > For example, with CoGBK you have multiple streams writing to a
> > >> table,
> > >> > > how
> > >> > > > > does that impact window merging?
> > >> > > > >
> > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > >> > > <takidau@google.com.invalid
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > >> > > > > >
> > >> > > > > > Apologies for the big cross post, but I thought this might
> be
> > >> > > something
> > >> > > > > all
> > >> > > > > > three communities would find relevant.
> > >> > > > > >
> > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > Calcite,
> > >> > > thanks
> > >> > > > to
> > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > >> conclusion
> > >> > > > about
> > >> > > > > > how to elegantly support the full suite of streaming
> > >> functionality
> > >> > in
> > >> > > > the
> > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > community
> > >> > have
> > >> > > > been
> > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> > >> others,
> > >> > > > thank
> > >> > > > > > you! :-), but from my understanding we still don't have a
> full
> > >> spec
> > >> > > for
> > >> > > > > how
> > >> > > > > > to support robust streaming in SQL (including but not
> limited
> > >> to,
> > >> > > > e.g., a
> > >> > > > > > triggers analogue such as EMIT).
> > >> > > > > >
> > >> > > > > > I've been spending a lot of time thinking about this and
> have
> > >> some
> > >> > > > > opinions
> > >> > > > > > about how I think it should look that I've already written
> > down,
> > >> > so I
> > >> > > > > > volunteered to try to drive forward agreement on a general
> > >> > streaming
> > >> > > > SQL
> > >> > > > > > spec between our three communities (well, technically I
> > >> volunteered
> > >> > > to
> > >> > > > do
> > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> might
> > >> want
> > >> > to
> > >> > > > > join
> > >> > > > > > in since you're going that direction already anyway and will
> > >> have
> > >> > > > useful
> > >> > > > > > insights :-).
> > >> > > > > >
> > >> > > > > > My plan was to do this by sharing two docs:
> > >> > > > > >
> > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > >> context,
> > >> > > and
> > >> > > > > >    really only mentions SQL in passing. But it describes the
> > >> > > > relationship
> > >> > > > > >    between the Beam Model and the "streams & tables" way of
> > >> > thinking,
> > >> > > > > which
> > >> > > > > >    turns out to be useful in understanding what robust
> > >> streaming in
> > >> > > SQL
> > >> > > > > > might
> > >> > > > > >    look like. Many of you probably already know some or all
> of
> > >> > what's
> > >> > > > in
> > >> > > > > > here,
> > >> > > > > >    but I felt it was necessary to have it all written down
> in
> > >> order
> > >> > > to
> > >> > > > > > justify
> > >> > > > > >    some of the proposals I wanted to make in the second doc.
> > >> > > > > >
> > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> doc
> > >> is
> > >> > > that
> > >> > > > it
> > >> > > > > >    would become a general specification for what robust
> > >> streaming
> > >> > SQL
> > >> > > > in
> > >> > > > > >    Calcite should look like. It would start out as a basic
> > >> proposal
> > >> > > of
> > >> > > > > what
> > >> > > > > >    things *could* look like (combining both what things look
> > >> like
> > >> > now
> > >> > > > as
> > >> > > > > > well
> > >> > > > > >    as a set of proposed changes for the future), and we
> could
> > >> all
> > >> > > > iterate
> > >> > > > > > on
> > >> > > > > >    it together until we get to something we're happy with.
> > >> > > > > >
> > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > monster,
> > >> > so I
> > >> > > > > > figured I'd share it and let folks hack at it with comments
> if
> > >> they
> > >> > > > have
> > >> > > > > > any, while I try to get the second doc ready in the
> meantime.
> > As
> > >> > part
> > >> > > > of
> > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> > try
> > >> to
> > >> > > > gather
> > >> > > > > > input on what things are already in flight for streaming SQL
> > >> across
> > >> > > the
> > >> > > > > > various communities, to make sure the proposal captures
> > >> everything
> > >> > > > that's
> > >> > > > > > going on as accurately as it can.
> > >> > > > > >
> > >> > > > > > If you have any questions or comments, I'm interested to
> hear
> > >> them.
> > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> Tables":
> > >> > > > > >
> > >> > > > > >   http://s.apache.org/beam-streams-tables
> > >> > > > > >
> > >> > > > > > -Tyler
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

On Tue, May 9, 2017 at 3:06 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thank you very much for this excellent write-up and the super nice
> visualizations!
> You are discussing a lot of the things that we have been thinking about as
> well from a different perspective.
> IMO, yours and the Flink model are pretty much aligned although we use a
> different terminology (which is not yet completely established). So there
> should be room for unification ;-)
>

Good to hear, thanks for giving it a look. :-)


> Allow me a few words on the current state in Flink. In the upcoming 1.3.0
> release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
> RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
> aggregations. The group windows are triggered by watermark and the over
> window and non-windowed aggregations emit for each input record
> (AtCount(1)). The window aggregations do not support early or late firing
> (late records are dropped), so no updates here. However, the non-windowed
> aggregates produce updates (in acc and acc/retract mode). Based on this we
> will work on better control for late updates and early firing as well as
> joins in the next release.
>

Impressive. At this rate there's a good chance we'll just be doing catchup
and thanking you for building everything. ;-) Do you have ideas for what
you want your early/late updates control to look like? That's one of the
areas I'd like to see better defined for us. And how deep are you going
with joins?


> Reading the document, I did not find any major difference in our concepts.
> In fact, we are aiming to support the cases you are describing as well.
> I have a question though. Would you classify an OVER aggregation as a
> stream -> stream or stream -> table operation? It collects records to
> aggregate them, but emits one record for each input row. Depending on the
> window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
> emit the result record when the input record is received.
>

I would call it a composite stream → stream operation (since SQL, like the
standard Beam/Flink APIs, is a higher level set of constructs than raw
streams/tables operations) consisting of a stream → table windowed grouping
followed by a table → stream triggering on every element, basically as you
described in the previous paragraph.

-Tyler


>
> I'm looking forward to the second part.
>
> Cheers, Fabian
>
>
>
> 2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <klk@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
> > >> > > which side of a join data comes from, perhaps?
> > >> > >
> > >> > > (delightful doc, as usual)
> > >> > >
> > >> > > Kenn
> > >> > >
> > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > >> <takidau@google.com.invalid
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > >> basically
> > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> the
> > >> > input
> > >> > > > streams into a single output stream. GBK then groups the data
> > within
> > >> > that
> > >> > > > single union stream as you might otherwise expect, yielding a
> > single
> > >> > > table.
> > >> > > > So I think it doesn't really impact things much. Grouping,
> > >> aggregation,
> > >> > > > window merging etc all just act upon the merged stream and
> > generate
> > >> > what
> > >> > > is
> > >> > > > effectively a merged table.
> > >> > > >
> > >> > > > -Tyler
> > >> > > >
> > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > >> <lcwik@google.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The doc is a good read.
> > >> > > > >
> > >> > > > > I think you do a great job of explaining table -> stream,
> stream
> > >> ->
> > >> > > > stream,
> > >> > > > > and stream -> table when there is only one stream.
> > >> > > > > But when there are multiple streams reading/writing to a
> table,
> > >> how
> > >> > > does
> > >> > > > > that impact what occurs?
> > >> > > > > For example, with CoGBK you have multiple streams writing to a
> > >> table,
> > >> > > how
> > >> > > > > does that impact window merging?
> > >> > > > >
> > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > >> > > <takidau@google.com.invalid
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > >> > > > > >
> > >> > > > > > Apologies for the big cross post, but I thought this might
> be
> > >> > > something
> > >> > > > > all
> > >> > > > > > three communities would find relevant.
> > >> > > > > >
> > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > Calcite,
> > >> > > thanks
> > >> > > > to
> > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > >> conclusion
> > >> > > > about
> > >> > > > > > how to elegantly support the full suite of streaming
> > >> functionality
> > >> > in
> > >> > > > the
> > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > community
> > >> > have
> > >> > > > been
> > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> > >> others,
> > >> > > > thank
> > >> > > > > > you! :-), but from my understanding we still don't have a
> full
> > >> spec
> > >> > > for
> > >> > > > > how
> > >> > > > > > to support robust streaming in SQL (including but not
> limited
> > >> to,
> > >> > > > e.g., a
> > >> > > > > > triggers analogue such as EMIT).
> > >> > > > > >
> > >> > > > > > I've been spending a lot of time thinking about this and
> have
> > >> some
> > >> > > > > opinions
> > >> > > > > > about how I think it should look that I've already written
> > down,
> > >> > so I
> > >> > > > > > volunteered to try to drive forward agreement on a general
> > >> > streaming
> > >> > > > SQL
> > >> > > > > > spec between our three communities (well, technically I
> > >> volunteered
> > >> > > to
> > >> > > > do
> > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> might
> > >> want
> > >> > to
> > >> > > > > join
> > >> > > > > > in since you're going that direction already anyway and will
> > >> have
> > >> > > > useful
> > >> > > > > > insights :-).
> > >> > > > > >
> > >> > > > > > My plan was to do this by sharing two docs:
> > >> > > > > >
> > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > >> context,
> > >> > > and
> > >> > > > > >    really only mentions SQL in passing. But it describes the
> > >> > > > relationship
> > >> > > > > >    between the Beam Model and the "streams & tables" way of
> > >> > thinking,
> > >> > > > > which
> > >> > > > > >    turns out to be useful in understanding what robust
> > >> streaming in
> > >> > > SQL
> > >> > > > > > might
> > >> > > > > >    look like. Many of you probably already know some or all
> of
> > >> > what's
> > >> > > > in
> > >> > > > > > here,
> > >> > > > > >    but I felt it was necessary to have it all written down
> in
> > >> order
> > >> > > to
> > >> > > > > > justify
> > >> > > > > >    some of the proposals I wanted to make in the second doc.
> > >> > > > > >
> > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> doc
> > >> is
> > >> > > that
> > >> > > > it
> > >> > > > > >    would become a general specification for what robust
> > >> streaming
> > >> > SQL
> > >> > > > in
> > >> > > > > >    Calcite should look like. It would start out as a basic
> > >> proposal
> > >> > > of
> > >> > > > > what
> > >> > > > > >    things *could* look like (combining both what things look
> > >> like
> > >> > now
> > >> > > > as
> > >> > > > > > well
> > >> > > > > >    as a set of proposed changes for the future), and we
> could
> > >> all
> > >> > > > iterate
> > >> > > > > > on
> > >> > > > > >    it together until we get to something we're happy with.
> > >> > > > > >
> > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > monster,
> > >> > so I
> > >> > > > > > figured I'd share it and let folks hack at it with comments
> if
> > >> they
> > >> > > > have
> > >> > > > > > any, while I try to get the second doc ready in the
> meantime.
> > As
> > >> > part
> > >> > > > of
> > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> > try
> > >> to
> > >> > > > gather
> > >> > > > > > input on what things are already in flight for streaming SQL
> > >> across
> > >> > > the
> > >> > > > > > various communities, to make sure the proposal captures
> > >> everything
> > >> > > > that's
> > >> > > > > > going on as accurately as it can.
> > >> > > > > >
> > >> > > > > > If you have any questions or comments, I'm interested to
> hear
> > >> them.
> > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> Tables":
> > >> > > > > >
> > >> > > > > >   http://s.apache.org/beam-streams-tables
> > >> > > > > >
> > >> > > > > > -Tyler
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thank you very much for this excellent write-up and the super nice
visualizations!
You are discussing a lot of the things that we have been thinking about as
well from a different perspective.
IMO, yours and the Flink model are pretty much aligned although we use a
different terminology (which is not yet completely established). So there
should be room for unification ;-)

Allow me a few words on the current state in Flink. In the upcoming 1.3.0
release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
aggregations. The group windows are triggered by watermark and the over
window and non-windowed aggregations emit for each input record
(AtCount(1)). The window aggregations do not support early or late firing
(late records are dropped), so no updates here. However, the non-windowed
aggregates produce updates (in acc and acc/retract mode). Based on this we
will work on better control for late updates and early firing as well as
joins in the next release.

Reading the document, I did not find any major difference in our concepts.
In fact, we are aiming to support the cases you are describing as well.
I have a question though. Would you classify an OVER aggregation as a
stream -> stream or stream -> table operation? It collects records to
aggregate them, but emits one record for each input row. Depending on the
window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
emit the result record when the input record is received.

I'm looking forward to the second part.

Cheers, Fabian



2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Any thoughts here Fabian? I'm planning to start sending out some more
> emails towards the end of the week.
>
> -Tyler
>
>
> On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
>
> > No worries, thanks for the heads up. Good luck wrapping all that stuff
> up.
> >
> > -Tyler
> >
> > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> >> Hi Tyler,
> >>
> >> thanks for pushing this effort and including the Flink list.
> >> I haven't managed to read the doc yet, but just wanted to thank you for
> >> the
> >> write-up and let you know that I'm very interested in this discussion.
> >>
> >> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> >> getting as many contributions merged before the release is forked off.
> >> When that happened, I'll have more time to read and comment.
> >>
> >> Thanks,
> >> Fabian
> >>
> >>
> >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >>
> >> > Good point, when you start talking about anything less than a full
> join,
> >> > triggers get involved to describe how one actually achieves the
> desired
> >> > semantics, and they may end up being tied to just one of the inputs
> >> (e.g.,
> >> > you may only care about the watermark for one side of the join). Am
> >> > expecting us to address these sorts of details more precisely in doc
> #2.
> >> >
> >> > -Tyler
> >> >
> >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> <klk@google.com.invalid
> >> >
> >> > wrote:
> >> >
> >> > > There's something to be said about having different triggering
> >> depending
> >> > on
> >> > > which side of a join data comes from, perhaps?
> >> > >
> >> > > (delightful doc, as usual)
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> >> <takidau@google.com.invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> >> basically
> >> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> >> > input
> >> > > > streams into a single output stream. GBK then groups the data
> within
> >> > that
> >> > > > single union stream as you might otherwise expect, yielding a
> single
> >> > > table.
> >> > > > So I think it doesn't really impact things much. Grouping,
> >> aggregation,
> >> > > > window merging etc all just act upon the merged stream and
> generate
> >> > what
> >> > > is
> >> > > > effectively a merged table.
> >> > > >
> >> > > > -Tyler
> >> > > >
> >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> >> <lcwik@google.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > The doc is a good read.
> >> > > > >
> >> > > > > I think you do a great job of explaining table -> stream, stream
> >> ->
> >> > > > stream,
> >> > > > > and stream -> table when there is only one stream.
> >> > > > > But when there are multiple streams reading/writing to a table,
> >> how
> >> > > does
> >> > > > > that impact what occurs?
> >> > > > > For example, with CoGBK you have multiple streams writing to a
> >> table,
> >> > > how
> >> > > > > does that impact window merging?
> >> > > > >
> >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> >> > > <takidau@google.com.invalid
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> >> > > > > >
> >> > > > > > Apologies for the big cross post, but I thought this might be
> >> > > something
> >> > > > > all
> >> > > > > > three communities would find relevant.
> >> > > > > >
> >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> Calcite,
> >> > > thanks
> >> > > > to
> >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> >> conclusion
> >> > > > about
> >> > > > > > how to elegantly support the full suite of streaming
> >> functionality
> >> > in
> >> > > > the
> >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> community
> >> > have
> >> > > > been
> >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> >> others,
> >> > > > thank
> >> > > > > > you! :-), but from my understanding we still don't have a full
> >> spec
> >> > > for
> >> > > > > how
> >> > > > > > to support robust streaming in SQL (including but not limited
> >> to,
> >> > > > e.g., a
> >> > > > > > triggers analogue such as EMIT).
> >> > > > > >
> >> > > > > > I've been spending a lot of time thinking about this and have
> >> some
> >> > > > > opinions
> >> > > > > > about how I think it should look that I've already written
> down,
> >> > so I
> >> > > > > > volunteered to try to drive forward agreement on a general
> >> > streaming
> >> > > > SQL
> >> > > > > > spec between our three communities (well, technically I
> >> volunteered
> >> > > to
> >> > > > do
> >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> >> want
> >> > to
> >> > > > > join
> >> > > > > > in since you're going that direction already anyway and will
> >> have
> >> > > > useful
> >> > > > > > insights :-).
> >> > > > > >
> >> > > > > > My plan was to do this by sharing two docs:
> >> > > > > >
> >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> >> context,
> >> > > and
> >> > > > > >    really only mentions SQL in passing. But it describes the
> >> > > > relationship
> >> > > > > >    between the Beam Model and the "streams & tables" way of
> >> > thinking,
> >> > > > > which
> >> > > > > >    turns out to be useful in understanding what robust
> >> streaming in
> >> > > SQL
> >> > > > > > might
> >> > > > > >    look like. Many of you probably already know some or all of
> >> > what's
> >> > > > in
> >> > > > > > here,
> >> > > > > >    but I felt it was necessary to have it all written down in
> >> order
> >> > > to
> >> > > > > > justify
> >> > > > > >    some of the proposals I wanted to make in the second doc.
> >> > > > > >
> >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
> >> is
> >> > > that
> >> > > > it
> >> > > > > >    would become a general specification for what robust
> >> streaming
> >> > SQL
> >> > > > in
> >> > > > > >    Calcite should look like. It would start out as a basic
> >> proposal
> >> > > of
> >> > > > > what
> >> > > > > >    things *could* look like (combining both what things look
> >> like
> >> > now
> >> > > > as
> >> > > > > > well
> >> > > > > >    as a set of proposed changes for the future), and we could
> >> all
> >> > > > iterate
> >> > > > > > on
> >> > > > > >    it together until we get to something we're happy with.
> >> > > > > >
> >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> monster,
> >> > so I
> >> > > > > > figured I'd share it and let folks hack at it with comments if
> >> they
> >> > > > have
> >> > > > > > any, while I try to get the second doc ready in the meantime.
> As
> >> > part
> >> > > > of
> >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> try
> >> to
> >> > > > gather
> >> > > > > > input on what things are already in flight for streaming SQL
> >> across
> >> > > the
> >> > > > > > various communities, to make sure the proposal captures
> >> everything
> >> > > > that's
> >> > > > > > going on as accurately as it can.
> >> > > > > >
> >> > > > > > If you have any questions or comments, I'm interested to hear
> >> them.
> >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >> > > > > >
> >> > > > > >   http://s.apache.org/beam-streams-tables
> >> > > > > >
> >> > > > > > -Tyler
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thank you very much for this excellent write-up and the super nice
visualizations!
You are discussing a lot of the things that we have been thinking about as
well from a different perspective.
IMO, yours and the Flink model are pretty much aligned although we use a
different terminology (which is not yet completely established). So there
should be room for unification ;-)

Allow me a few words on the current state in Flink. In the upcoming 1.3.0
release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
aggregations. The group windows are triggered by watermark and the over
window and non-windowed aggregations emit for each input record
(AtCount(1)). The window aggregations do not support early or late firing
(late records are dropped), so no updates here. However, the non-windowed
aggregates produce updates (in acc and acc/retract mode). Based on this we
will work on better control for late updates and early firing as well as
joins in the next release.

Reading the document, I did not find any major difference in our concepts.
In fact, we are aiming to support the cases you are describing as well.
I have a question though. Would you classify an OVER aggregation as a
stream -> stream or stream -> table operation? It collects records to
aggregate them, but emits one record for each input row. Depending on the
window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
emit the result record when the input record is received.

I'm looking forward to the second part.

Cheers, Fabian



2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Any thoughts here Fabian? I'm planning to start sending out some more
> emails towards the end of the week.
>
> -Tyler
>
>
> On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
>
> > No worries, thanks for the heads up. Good luck wrapping all that stuff
> up.
> >
> > -Tyler
> >
> > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> >> Hi Tyler,
> >>
> >> thanks for pushing this effort and including the Flink list.
> >> I haven't managed to read the doc yet, but just wanted to thank you for
> >> the
> >> write-up and let you know that I'm very interested in this discussion.
> >>
> >> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> >> getting as many contributions merged before the release is forked off.
> >> When that happened, I'll have more time to read and comment.
> >>
> >> Thanks,
> >> Fabian
> >>
> >>
> >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >>
> >> > Good point, when you start talking about anything less than a full
> join,
> >> > triggers get involved to describe how one actually achieves the
> desired
> >> > semantics, and they may end up being tied to just one of the inputs
> >> (e.g.,
> >> > you may only care about the watermark for one side of the join). Am
> >> > expecting us to address these sorts of details more precisely in doc
> #2.
> >> >
> >> > -Tyler
> >> >
> >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> <klk@google.com.invalid
> >> >
> >> > wrote:
> >> >
> >> > > There's something to be said about having different triggering
> >> depending
> >> > on
> >> > > which side of a join data comes from, perhaps?
> >> > >
> >> > > (delightful doc, as usual)
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> >> <takidau@google.com.invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> >> basically
> >> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> >> > input
> >> > > > streams into a single output stream. GBK then groups the data
> within
> >> > that
> >> > > > single union stream as you might otherwise expect, yielding a
> single
> >> > > table.
> >> > > > So I think it doesn't really impact things much. Grouping,
> >> aggregation,
> >> > > > window merging etc all just act upon the merged stream and
> generate
> >> > what
> >> > > is
> >> > > > effectively a merged table.
> >> > > >
> >> > > > -Tyler
> >> > > >
> >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> >> <lcwik@google.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > The doc is a good read.
> >> > > > >
> >> > > > > I think you do a great job of explaining table -> stream, stream
> >> ->
> >> > > > stream,
> >> > > > > and stream -> table when there is only one stream.
> >> > > > > But when there are multiple streams reading/writing to a table,
> >> how
> >> > > does
> >> > > > > that impact what occurs?
> >> > > > > For example, with CoGBK you have multiple streams writing to a
> >> table,
> >> > > how
> >> > > > > does that impact window merging?
> >> > > > >
> >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> >> > > <takidau@google.com.invalid
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> >> > > > > >
> >> > > > > > Apologies for the big cross post, but I thought this might be
> >> > > something
> >> > > > > all
> >> > > > > > three communities would find relevant.
> >> > > > > >
> >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> Calcite,
> >> > > thanks
> >> > > > to
> >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> >> conclusion
> >> > > > about
> >> > > > > > how to elegantly support the full suite of streaming
> >> functionality
> >> > in
> >> > > > the
> >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> community
> >> > have
> >> > > > been
> >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> >> others,
> >> > > > thank
> >> > > > > > you! :-), but from my understanding we still don't have a full
> >> spec
> >> > > for
> >> > > > > how
> >> > > > > > to support robust streaming in SQL (including but not limited
> >> to,
> >> > > > e.g., a
> >> > > > > > triggers analogue such as EMIT).
> >> > > > > >
> >> > > > > > I've been spending a lot of time thinking about this and have
> >> some
> >> > > > > opinions
> >> > > > > > about how I think it should look that I've already written
> down,
> >> > so I
> >> > > > > > volunteered to try to drive forward agreement on a general
> >> > streaming
> >> > > > SQL
> >> > > > > > spec between our three communities (well, technically I
> >> volunteered
> >> > > to
> >> > > > do
> >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> >> want
> >> > to
> >> > > > > join
> >> > > > > > in since you're going that direction already anyway and will
> >> have
> >> > > > useful
> >> > > > > > insights :-).
> >> > > > > >
> >> > > > > > My plan was to do this by sharing two docs:
> >> > > > > >
> >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> >> context,
> >> > > and
> >> > > > > >    really only mentions SQL in passing. But it describes the
> >> > > > relationship
> >> > > > > >    between the Beam Model and the "streams & tables" way of
> >> > thinking,
> >> > > > > which
> >> > > > > >    turns out to be useful in understanding what robust
> >> streaming in
> >> > > SQL
> >> > > > > > might
> >> > > > > >    look like. Many of you probably already know some or all of
> >> > what's
> >> > > > in
> >> > > > > > here,
> >> > > > > >    but I felt it was necessary to have it all written down in
> >> order
> >> > > to
> >> > > > > > justify
> >> > > > > >    some of the proposals I wanted to make in the second doc.
> >> > > > > >
> >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
> >> is
> >> > > that
> >> > > > it
> >> > > > > >    would become a general specification for what robust
> >> streaming
> >> > SQL
> >> > > > in
> >> > > > > >    Calcite should look like. It would start out as a basic
> >> proposal
> >> > > of
> >> > > > > what
> >> > > > > >    things *could* look like (combining both what things look
> >> like
> >> > now
> >> > > > as
> >> > > > > > well
> >> > > > > >    as a set of proposed changes for the future), and we could
> >> all
> >> > > > iterate
> >> > > > > > on
> >> > > > > >    it together until we get to something we're happy with.
> >> > > > > >
> >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> monster,
> >> > so I
> >> > > > > > figured I'd share it and let folks hack at it with comments if
> >> they
> >> > > > have
> >> > > > > > any, while I try to get the second doc ready in the meantime.
> As
> >> > part
> >> > > > of
> >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> try
> >> to
> >> > > > gather
> >> > > > > > input on what things are already in flight for streaming SQL
> >> across
> >> > > the
> >> > > > > > various communities, to make sure the proposal captures
> >> everything
> >> > > > that's
> >> > > > > > going on as accurately as it can.
> >> > > > > >
> >> > > > > > If you have any questions or comments, I'm interested to hear
> >> them.
> >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >> > > > > >
> >> > > > > >   http://s.apache.org/beam-streams-tables
> >> > > > > >
> >> > > > > > -Tyler
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thank you very much for this excellent write-up and the super nice
visualizations!
You are discussing a lot of the things that we have been thinking about as
well from a different perspective.
IMO, yours and the Flink model are pretty much aligned although we use a
different terminology (which is not yet completely established). So there
should be room for unification ;-)

Allow me a few words on the current state in Flink. In the upcoming 1.3.0
release, we will have support for group window (TUMBLE, HOP, SESSION), OVER
RANGE/ROW window (without FOLLOWING), and non-windowed GROUP BY
aggregations. The group windows are triggered by watermark and the over
window and non-windowed aggregations emit for each input record
(AtCount(1)). The window aggregations do not support early or late firing
(late records are dropped), so no updates here. However, the non-windowed
aggregates produce updates (in acc and acc/retract mode). Based on this we
will work on better control for late updates and early firing as well as
joins in the next release.

Reading the document, I did not find any major difference in our concepts.
In fact, we are aiming to support the cases you are describing as well.
I have a question though. Would you classify an OVER aggregation as a
stream -> stream or stream -> table operation? It collects records to
aggregate them, but emits one record for each input row. Depending on the
window definition (i.e., with FOLLOWING CURRENT ROW), you can compute and
emit the result record when the input record is received.

I'm looking forward to the second part.

Cheers, Fabian



2017-05-09 0:34 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Any thoughts here Fabian? I'm planning to start sending out some more
> emails towards the end of the week.
>
> -Tyler
>
>
> On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
>
> > No worries, thanks for the heads up. Good luck wrapping all that stuff
> up.
> >
> > -Tyler
> >
> > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> >> Hi Tyler,
> >>
> >> thanks for pushing this effort and including the Flink list.
> >> I haven't managed to read the doc yet, but just wanted to thank you for
> >> the
> >> write-up and let you know that I'm very interested in this discussion.
> >>
> >> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> >> getting as many contributions merged before the release is forked off.
> >> When that happened, I'll have more time to read and comment.
> >>
> >> Thanks,
> >> Fabian
> >>
> >>
> >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >>
> >> > Good point, when you start talking about anything less than a full
> join,
> >> > triggers get involved to describe how one actually achieves the
> desired
> >> > semantics, and they may end up being tied to just one of the inputs
> >> (e.g.,
> >> > you may only care about the watermark for one side of the join). Am
> >> > expecting us to address these sorts of details more precisely in doc
> #2.
> >> >
> >> > -Tyler
> >> >
> >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> <klk@google.com.invalid
> >> >
> >> > wrote:
> >> >
> >> > > There's something to be said about having different triggering
> >> depending
> >> > on
> >> > > which side of a join data comes from, perhaps?
> >> > >
> >> > > (delightful doc, as usual)
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> >> <takidau@google.com.invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> >> basically
> >> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> >> > input
> >> > > > streams into a single output stream. GBK then groups the data
> within
> >> > that
> >> > > > single union stream as you might otherwise expect, yielding a
> single
> >> > > table.
> >> > > > So I think it doesn't really impact things much. Grouping,
> >> aggregation,
> >> > > > window merging etc all just act upon the merged stream and
> generate
> >> > what
> >> > > is
> >> > > > effectively a merged table.
> >> > > >
> >> > > > -Tyler
> >> > > >
> >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> >> <lcwik@google.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > The doc is a good read.
> >> > > > >
> >> > > > > I think you do a great job of explaining table -> stream, stream
> >> ->
> >> > > > stream,
> >> > > > > and stream -> table when there is only one stream.
> >> > > > > But when there are multiple streams reading/writing to a table,
> >> how
> >> > > does
> >> > > > > that impact what occurs?
> >> > > > > For example, with CoGBK you have multiple streams writing to a
> >> table,
> >> > > how
> >> > > > > does that impact window merging?
> >> > > > >
> >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> >> > > <takidau@google.com.invalid
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> >> > > > > >
> >> > > > > > Apologies for the big cross post, but I thought this might be
> >> > > something
> >> > > > > all
> >> > > > > > three communities would find relevant.
> >> > > > > >
> >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> Calcite,
> >> > > thanks
> >> > > > to
> >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> >> conclusion
> >> > > > about
> >> > > > > > how to elegantly support the full suite of streaming
> >> functionality
> >> > in
> >> > > > the
> >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> community
> >> > have
> >> > > > been
> >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> >> others,
> >> > > > thank
> >> > > > > > you! :-), but from my understanding we still don't have a full
> >> spec
> >> > > for
> >> > > > > how
> >> > > > > > to support robust streaming in SQL (including but not limited
> >> to,
> >> > > > e.g., a
> >> > > > > > triggers analogue such as EMIT).
> >> > > > > >
> >> > > > > > I've been spending a lot of time thinking about this and have
> >> some
> >> > > > > opinions
> >> > > > > > about how I think it should look that I've already written
> down,
> >> > so I
> >> > > > > > volunteered to try to drive forward agreement on a general
> >> > streaming
> >> > > > SQL
> >> > > > > > spec between our three communities (well, technically I
> >> volunteered
> >> > > to
> >> > > > do
> >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> >> want
> >> > to
> >> > > > > join
> >> > > > > > in since you're going that direction already anyway and will
> >> have
> >> > > > useful
> >> > > > > > insights :-).
> >> > > > > >
> >> > > > > > My plan was to do this by sharing two docs:
> >> > > > > >
> >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> >> context,
> >> > > and
> >> > > > > >    really only mentions SQL in passing. But it describes the
> >> > > > relationship
> >> > > > > >    between the Beam Model and the "streams & tables" way of
> >> > thinking,
> >> > > > > which
> >> > > > > >    turns out to be useful in understanding what robust
> >> streaming in
> >> > > SQL
> >> > > > > > might
> >> > > > > >    look like. Many of you probably already know some or all of
> >> > what's
> >> > > > in
> >> > > > > > here,
> >> > > > > >    but I felt it was necessary to have it all written down in
> >> order
> >> > > to
> >> > > > > > justify
> >> > > > > >    some of the proposals I wanted to make in the second doc.
> >> > > > > >
> >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
> >> is
> >> > > that
> >> > > > it
> >> > > > > >    would become a general specification for what robust
> >> streaming
> >> > SQL
> >> > > > in
> >> > > > > >    Calcite should look like. It would start out as a basic
> >> proposal
> >> > > of
> >> > > > > what
> >> > > > > >    things *could* look like (combining both what things look
> >> like
> >> > now
> >> > > > as
> >> > > > > > well
> >> > > > > >    as a set of proposed changes for the future), and we could
> >> all
> >> > > > iterate
> >> > > > > > on
> >> > > > > >    it together until we get to something we're happy with.
> >> > > > > >
> >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> monster,
> >> > so I
> >> > > > > > figured I'd share it and let folks hack at it with comments if
> >> they
> >> > > > have
> >> > > > > > any, while I try to get the second doc ready in the meantime.
> As
> >> > part
> >> > > > of
> >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> try
> >> to
> >> > > > gather
> >> > > > > > input on what things are already in flight for streaming SQL
> >> across
> >> > > the
> >> > > > > > various communities, to make sure the proposal captures
> >> everything
> >> > > > that's
> >> > > > > > going on as accurately as it can.
> >> > > > > >
> >> > > > > > If you have any questions or comments, I'm interested to hear
> >> them.
> >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >> > > > > >
> >> > > > > >   http://s.apache.org/beam-streams-tables
> >> > > > > >
> >> > > > > > -Tyler
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Mingmin Xu <mi...@gmail.com>.

Hi Jesse,

Glad to see lots of people interested in Beam SQL. Agree to improve the
document(create BEAM-2227 <https://issues.apache.org/jira/browse/BEAM-2227>),
so users can understand how SQL is executed with Beam. IMO Beam SQL targets
to fill the gap between standard SQL queries and Beam Pipeline, to run with
CLI or as DSL. I don't think Beam will have its own SQL engine.

Thanks!
Mingmin

On Mon, May 8, 2017 at 4:11 PM, Jesse Anderson <je...@bigdatainstitute.io>
wrote:

> -Other dev lists
>
> I'm just coming off speaking about Beam at GOTO Chicago and QCON Sao Paulo.
> There was a ton of interest in Beam with SQL as a cross-framework way of
> doing SQL.
>
> There's some confusion where people think we're just doing a pass through
> to the framework's SQL engine. We'll have to make sure we're clear on how
> Beam's SQL works in the docs.
>
> Thanks,
>
> Jesse
>
> On Mon, May 8, 2017 at 3:34 PM Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Any thoughts here Fabian? I'm planning to start sending out some more
> > emails towards the end of the week.
> >
> > -Tyler
> >
> >
> > On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
> >
> > > No worries, thanks for the heads up. Good luck wrapping all that stuff
> > up.
> > >
> > > -Tyler
> > >
> > > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > >> Hi Tyler,
> > >>
> > >> thanks for pushing this effort and including the Flink list.
> > >> I haven't managed to read the doc yet, but just wanted to thank you
> for
> > >> the
> > >> write-up and let you know that I'm very interested in this discussion.
> > >>
> > >> We are very close to the feature freeze of Flink 1.3 and I'm quite
> busy
> > >> getting as many contributions merged before the release is forked off.
> > >> When that happened, I'll have more time to read and comment.
> > >>
> > >> Thanks,
> > >> Fabian
> > >>
> > >>
> > >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> > >>
> > >> > Good point, when you start talking about anything less than a full
> > join,
> > >> > triggers get involved to describe how one actually achieves the
> > desired
> > >> > semantics, and they may end up being tied to just one of the inputs
> > >> (e.g.,
> > >> > you may only care about the watermark for one side of the join). Am
> > >> > expecting us to address these sorts of details more precisely in doc
> > #2.
> > >> >
> > >> > -Tyler
> > >> >
> > >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> > <klk@google.com.invalid
> > >> >
> > >> > wrote:
> > >> >
> > >> > > There's something to be said about having different triggering
> > >> depending
> > >> > on
> > >> > > which side of a join data comes from, perhaps?
> > >> > >
> > >> > > (delightful doc, as usual)
> > >> > >
> > >> > > Kenn
> > >> > >
> > >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> > >> <takidau@google.com.invalid
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> > >> basically
> > >> > > > flatten + GBK. Flatten is a non-grouping operation that merges
> the
> > >> > input
> > >> > > > streams into a single output stream. GBK then groups the data
> > within
> > >> > that
> > >> > > > single union stream as you might otherwise expect, yielding a
> > single
> > >> > > table.
> > >> > > > So I think it doesn't really impact things much. Grouping,
> > >> aggregation,
> > >> > > > window merging etc all just act upon the merged stream and
> > generate
> > >> > what
> > >> > > is
> > >> > > > effectively a merged table.
> > >> > > >
> > >> > > > -Tyler
> > >> > > >
> > >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> > >> <lcwik@google.com.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The doc is a good read.
> > >> > > > >
> > >> > > > > I think you do a great job of explaining table -> stream,
> stream
> > >> ->
> > >> > > > stream,
> > >> > > > > and stream -> table when there is only one stream.
> > >> > > > > But when there are multiple streams reading/writing to a
> table,
> > >> how
> > >> > > does
> > >> > > > > that impact what occurs?
> > >> > > > > For example, with CoGBK you have multiple streams writing to a
> > >> table,
> > >> > > how
> > >> > > > > does that impact window merging?
> > >> > > > >
> > >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > >> > > <takidau@google.com.invalid
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > >> > > > > >
> > >> > > > > > Apologies for the big cross post, but I thought this might
> be
> > >> > > something
> > >> > > > > all
> > >> > > > > > three communities would find relevant.
> > >> > > > > >
> > >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> > Calcite,
> > >> > > thanks
> > >> > > > to
> > >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> > >> conclusion
> > >> > > > about
> > >> > > > > > how to elegantly support the full suite of streaming
> > >> functionality
> > >> > in
> > >> > > > the
> > >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> > community
> > >> > have
> > >> > > > been
> > >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> > >> others,
> > >> > > > thank
> > >> > > > > > you! :-), but from my understanding we still don't have a
> full
> > >> spec
> > >> > > for
> > >> > > > > how
> > >> > > > > > to support robust streaming in SQL (including but not
> limited
> > >> to,
> > >> > > > e.g., a
> > >> > > > > > triggers analogue such as EMIT).
> > >> > > > > >
> > >> > > > > > I've been spending a lot of time thinking about this and
> have
> > >> some
> > >> > > > > opinions
> > >> > > > > > about how I think it should look that I've already written
> > down,
> > >> > so I
> > >> > > > > > volunteered to try to drive forward agreement on a general
> > >> > streaming
> > >> > > > SQL
> > >> > > > > > spec between our three communities (well, technically I
> > >> volunteered
> > >> > > to
> > >> > > > do
> > >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks
> might
> > >> want
> > >> > to
> > >> > > > > join
> > >> > > > > > in since you're going that direction already anyway and will
> > >> have
> > >> > > > useful
> > >> > > > > > insights :-).
> > >> > > > > >
> > >> > > > > > My plan was to do this by sharing two docs:
> > >> > > > > >
> > >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> > >> context,
> > >> > > and
> > >> > > > > >    really only mentions SQL in passing. But it describes the
> > >> > > > relationship
> > >> > > > > >    between the Beam Model and the "streams & tables" way of
> > >> > thinking,
> > >> > > > > which
> > >> > > > > >    turns out to be useful in understanding what robust
> > >> streaming in
> > >> > > SQL
> > >> > > > > > might
> > >> > > > > >    look like. Many of you probably already know some or all
> of
> > >> > what's
> > >> > > > in
> > >> > > > > > here,
> > >> > > > > >    but I felt it was necessary to have it all written down
> in
> > >> order
> > >> > > to
> > >> > > > > > justify
> > >> > > > > >    some of the proposals I wanted to make in the second doc.
> > >> > > > > >
> > >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this
> doc
> > >> is
> > >> > > that
> > >> > > > it
> > >> > > > > >    would become a general specification for what robust
> > >> streaming
> > >> > SQL
> > >> > > > in
> > >> > > > > >    Calcite should look like. It would start out as a basic
> > >> proposal
> > >> > > of
> > >> > > > > what
> > >> > > > > >    things *could* look like (combining both what things look
> > >> like
> > >> > now
> > >> > > > as
> > >> > > > > > well
> > >> > > > > >    as a set of proposed changes for the future), and we
> could
> > >> all
> > >> > > > iterate
> > >> > > > > > on
> > >> > > > > >    it together until we get to something we're happy with.
> > >> > > > > >
> > >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> > monster,
> > >> > so I
> > >> > > > > > figured I'd share it and let folks hack at it with comments
> if
> > >> they
> > >> > > > have
> > >> > > > > > any, while I try to get the second doc ready in the
> meantime.
> > As
> > >> > part
> > >> > > > of
> > >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> > try
> > >> to
> > >> > > > gather
> > >> > > > > > input on what things are already in flight for streaming SQL
> > >> across
> > >> > > the
> > >> > > > > > various communities, to make sure the proposal captures
> > >> everything
> > >> > > > that's
> > >> > > > > > going on as accurately as it can.
> > >> > > > > >
> > >> > > > > > If you have any questions or comments, I'm interested to
> hear
> > >> them.
> > >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams &
> Tables":
> > >> > > > > >
> > >> > > > > >   http://s.apache.org/beam-streams-tables
> > >> > > > > >
> > >> > > > > > -Tyler
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
> --
> Thanks,
>
> Jesse
>



-- 
----
Mingmin

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Jesse Anderson <je...@bigdatainstitute.io>.

-Other dev lists

I'm just coming off speaking about Beam at GOTO Chicago and QCON Sao Paulo.
There was a ton of interest in Beam with SQL as a cross-framework way of
doing SQL.

There's some confusion where people think we're just doing a pass through
to the framework's SQL engine. We'll have to make sure we're clear on how
Beam's SQL works in the docs.

Thanks,

Jesse

On Mon, May 8, 2017 at 3:34 PM Tyler Akidau <ta...@google.com.invalid>
wrote:

> Any thoughts here Fabian? I'm planning to start sending out some more
> emails towards the end of the week.
>
> -Tyler
>
>
> On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:
>
> > No worries, thanks for the heads up. Good luck wrapping all that stuff
> up.
> >
> > -Tyler
> >
> > On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> >> Hi Tyler,
> >>
> >> thanks for pushing this effort and including the Flink list.
> >> I haven't managed to read the doc yet, but just wanted to thank you for
> >> the
> >> write-up and let you know that I'm very interested in this discussion.
> >>
> >> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> >> getting as many contributions merged before the release is forked off.
> >> When that happened, I'll have more time to read and comment.
> >>
> >> Thanks,
> >> Fabian
> >>
> >>
> >> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
> >>
> >> > Good point, when you start talking about anything less than a full
> join,
> >> > triggers get involved to describe how one actually achieves the
> desired
> >> > semantics, and they may end up being tied to just one of the inputs
> >> (e.g.,
> >> > you may only care about the watermark for one side of the join). Am
> >> > expecting us to address these sorts of details more precisely in doc
> #2.
> >> >
> >> > -Tyler
> >> >
> >> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles
> <klk@google.com.invalid
> >> >
> >> > wrote:
> >> >
> >> > > There's something to be said about having different triggering
> >> depending
> >> > on
> >> > > which side of a join data comes from, perhaps?
> >> > >
> >> > > (delightful doc, as usual)
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> >> <takidau@google.com.invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> >> basically
> >> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> >> > input
> >> > > > streams into a single output stream. GBK then groups the data
> within
> >> > that
> >> > > > single union stream as you might otherwise expect, yielding a
> single
> >> > > table.
> >> > > > So I think it doesn't really impact things much. Grouping,
> >> aggregation,
> >> > > > window merging etc all just act upon the merged stream and
> generate
> >> > what
> >> > > is
> >> > > > effectively a merged table.
> >> > > >
> >> > > > -Tyler
> >> > > >
> >> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> >> <lcwik@google.com.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > The doc is a good read.
> >> > > > >
> >> > > > > I think you do a great job of explaining table -> stream, stream
> >> ->
> >> > > > stream,
> >> > > > > and stream -> table when there is only one stream.
> >> > > > > But when there are multiple streams reading/writing to a table,
> >> how
> >> > > does
> >> > > > > that impact what occurs?
> >> > > > > For example, with CoGBK you have multiple streams writing to a
> >> table,
> >> > > how
> >> > > > > does that impact window merging?
> >> > > > >
> >> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> >> > > <takidau@google.com.invalid
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello Beam, Calcite, and Flink dev lists!
> >> > > > > >
> >> > > > > > Apologies for the big cross post, but I thought this might be
> >> > > something
> >> > > > > all
> >> > > > > > three communities would find relevant.
> >> > > > > >
> >> > > > > > Beam is finally making progress on a SQL DSL utilizing
> Calcite,
> >> > > thanks
> >> > > > to
> >> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> >> conclusion
> >> > > > about
> >> > > > > > how to elegantly support the full suite of streaming
> >> functionality
> >> > in
> >> > > > the
> >> > > > > > Beam model in via Calcite SQL. You folks in the Flink
> community
> >> > have
> >> > > > been
> >> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> >> others,
> >> > > > thank
> >> > > > > > you! :-), but from my understanding we still don't have a full
> >> spec
> >> > > for
> >> > > > > how
> >> > > > > > to support robust streaming in SQL (including but not limited
> >> to,
> >> > > > e.g., a
> >> > > > > > triggers analogue such as EMIT).
> >> > > > > >
> >> > > > > > I've been spending a lot of time thinking about this and have
> >> some
> >> > > > > opinions
> >> > > > > > about how I think it should look that I've already written
> down,
> >> > so I
> >> > > > > > volunteered to try to drive forward agreement on a general
> >> > streaming
> >> > > > SQL
> >> > > > > > spec between our three communities (well, technically I
> >> volunteered
> >> > > to
> >> > > > do
> >> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> >> want
> >> > to
> >> > > > > join
> >> > > > > > in since you're going that direction already anyway and will
> >> have
> >> > > > useful
> >> > > > > > insights :-).
> >> > > > > >
> >> > > > > > My plan was to do this by sharing two docs:
> >> > > > > >
> >> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> >> context,
> >> > > and
> >> > > > > >    really only mentions SQL in passing. But it describes the
> >> > > > relationship
> >> > > > > >    between the Beam Model and the "streams & tables" way of
> >> > thinking,
> >> > > > > which
> >> > > > > >    turns out to be useful in understanding what robust
> >> streaming in
> >> > > SQL
> >> > > > > > might
> >> > > > > >    look like. Many of you probably already know some or all of
> >> > what's
> >> > > > in
> >> > > > > > here,
> >> > > > > >    but I felt it was necessary to have it all written down in
> >> order
> >> > > to
> >> > > > > > justify
> >> > > > > >    some of the proposals I wanted to make in the second doc.
> >> > > > > >
> >> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
> >> is
> >> > > that
> >> > > > it
> >> > > > > >    would become a general specification for what robust
> >> streaming
> >> > SQL
> >> > > > in
> >> > > > > >    Calcite should look like. It would start out as a basic
> >> proposal
> >> > > of
> >> > > > > what
> >> > > > > >    things *could* look like (combining both what things look
> >> like
> >> > now
> >> > > > as
> >> > > > > > well
> >> > > > > >    as a set of proposed changes for the future), and we could
> >> all
> >> > > > iterate
> >> > > > > > on
> >> > > > > >    it together until we get to something we're happy with.
> >> > > > > >
> >> > > > > > At this point, I have doc #1 ready, and it's a bit of a
> monster,
> >> > so I
> >> > > > > > figured I'd share it and let folks hack at it with comments if
> >> they
> >> > > > have
> >> > > > > > any, while I try to get the second doc ready in the meantime.
> As
> >> > part
> >> > > > of
> >> > > > > > getting doc #2 ready, I'll be starting a separate thread to
> try
> >> to
> >> > > > gather
> >> > > > > > input on what things are already in flight for streaming SQL
> >> across
> >> > > the
> >> > > > > > various communities, to make sure the proposal captures
> >> everything
> >> > > > that's
> >> > > > > > going on as accurately as it can.
> >> > > > > >
> >> > > > > > If you have any questions or comments, I'm interested to hear
> >> them.
> >> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >> > > > > >
> >> > > > > >   http://s.apache.org/beam-streams-tables
> >> > > > > >
> >> > > > > > -Tyler
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
-- 
Thanks,

Jesse

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Any thoughts here Fabian? I'm planning to start sending out some more
emails towards the end of the week.

-Tyler


On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:

> No worries, thanks for the heads up. Good luck wrapping all that stuff up.
>
> -Tyler
>
> On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Tyler,
>>
>> thanks for pushing this effort and including the Flink list.
>> I haven't managed to read the doc yet, but just wanted to thank you for
>> the
>> write-up and let you know that I'm very interested in this discussion.
>>
>> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
>> getting as many contributions merged before the release is forked off.
>> When that happened, I'll have more time to read and comment.
>>
>> Thanks,
>> Fabian
>>
>>
>> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>>
>> > Good point, when you start talking about anything less than a full join,
>> > triggers get involved to describe how one actually achieves the desired
>> > semantics, and they may end up being tied to just one of the inputs
>> (e.g.,
>> > you may only care about the watermark for one side of the join). Am
>> > expecting us to address these sorts of details more precisely in doc #2.
>> >
>> > -Tyler
>> >
>> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <klk@google.com.invalid
>> >
>> > wrote:
>> >
>> > > There's something to be said about having different triggering
>> depending
>> > on
>> > > which side of a join data comes from, perhaps?
>> > >
>> > > (delightful doc, as usual)
>> > >
>> > > Kenn
>> > >
>> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
>> <takidau@google.com.invalid
>> > >
>> > > wrote:
>> > >
>> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
>> basically
>> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
>> > input
>> > > > streams into a single output stream. GBK then groups the data within
>> > that
>> > > > single union stream as you might otherwise expect, yielding a single
>> > > table.
>> > > > So I think it doesn't really impact things much. Grouping,
>> aggregation,
>> > > > window merging etc all just act upon the merged stream and generate
>> > what
>> > > is
>> > > > effectively a merged table.
>> > > >
>> > > > -Tyler
>> > > >
>> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
>> <lcwik@google.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > > > The doc is a good read.
>> > > > >
>> > > > > I think you do a great job of explaining table -> stream, stream
>> ->
>> > > > stream,
>> > > > > and stream -> table when there is only one stream.
>> > > > > But when there are multiple streams reading/writing to a table,
>> how
>> > > does
>> > > > > that impact what occurs?
>> > > > > For example, with CoGBK you have multiple streams writing to a
>> table,
>> > > how
>> > > > > does that impact window merging?
>> > > > >
>> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
>> > > <takidau@google.com.invalid
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello Beam, Calcite, and Flink dev lists!
>> > > > > >
>> > > > > > Apologies for the big cross post, but I thought this might be
>> > > something
>> > > > > all
>> > > > > > three communities would find relevant.
>> > > > > >
>> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
>> > > thanks
>> > > > to
>> > > > > > Mingmin Xu. As you can imagine, we need to come to some
>> conclusion
>> > > > about
>> > > > > > how to elegantly support the full suite of streaming
>> functionality
>> > in
>> > > > the
>> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
>> > have
>> > > > been
>> > > > > > pushing on this (e.g., adding windowing constructs, amongst
>> others,
>> > > > thank
>> > > > > > you! :-), but from my understanding we still don't have a full
>> spec
>> > > for
>> > > > > how
>> > > > > > to support robust streaming in SQL (including but not limited
>> to,
>> > > > e.g., a
>> > > > > > triggers analogue such as EMIT).
>> > > > > >
>> > > > > > I've been spending a lot of time thinking about this and have
>> some
>> > > > > opinions
>> > > > > > about how I think it should look that I've already written down,
>> > so I
>> > > > > > volunteered to try to drive forward agreement on a general
>> > streaming
>> > > > SQL
>> > > > > > spec between our three communities (well, technically I
>> volunteered
>> > > to
>> > > > do
>> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
>> want
>> > to
>> > > > > join
>> > > > > > in since you're going that direction already anyway and will
>> have
>> > > > useful
>> > > > > > insights :-).
>> > > > > >
>> > > > > > My plan was to do this by sharing two docs:
>> > > > > >
>> > > > > >    1. The Beam Model : Streams & Tables - This one is for
>> context,
>> > > and
>> > > > > >    really only mentions SQL in passing. But it describes the
>> > > > relationship
>> > > > > >    between the Beam Model and the "streams & tables" way of
>> > thinking,
>> > > > > which
>> > > > > >    turns out to be useful in understanding what robust
>> streaming in
>> > > SQL
>> > > > > > might
>> > > > > >    look like. Many of you probably already know some or all of
>> > what's
>> > > > in
>> > > > > > here,
>> > > > > >    but I felt it was necessary to have it all written down in
>> order
>> > > to
>> > > > > > justify
>> > > > > >    some of the proposals I wanted to make in the second doc.
>> > > > > >
>> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
>> is
>> > > that
>> > > > it
>> > > > > >    would become a general specification for what robust
>> streaming
>> > SQL
>> > > > in
>> > > > > >    Calcite should look like. It would start out as a basic
>> proposal
>> > > of
>> > > > > what
>> > > > > >    things *could* look like (combining both what things look
>> like
>> > now
>> > > > as
>> > > > > > well
>> > > > > >    as a set of proposed changes for the future), and we could
>> all
>> > > > iterate
>> > > > > > on
>> > > > > >    it together until we get to something we're happy with.
>> > > > > >
>> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
>> > so I
>> > > > > > figured I'd share it and let folks hack at it with comments if
>> they
>> > > > have
>> > > > > > any, while I try to get the second doc ready in the meantime. As
>> > part
>> > > > of
>> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
>> to
>> > > > gather
>> > > > > > input on what things are already in flight for streaming SQL
>> across
>> > > the
>> > > > > > various communities, to make sure the proposal captures
>> everything
>> > > > that's
>> > > > > > going on as accurately as it can.
>> > > > > >
>> > > > > > If you have any questions or comments, I'm interested to hear
>> them.
>> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>> > > > > >
>> > > > > >   http://s.apache.org/beam-streams-tables
>> > > > > >
>> > > > > > -Tyler
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Any thoughts here Fabian? I'm planning to start sending out some more
emails towards the end of the week.

-Tyler


On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:

> No worries, thanks for the heads up. Good luck wrapping all that stuff up.
>
> -Tyler
>
> On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Tyler,
>>
>> thanks for pushing this effort and including the Flink list.
>> I haven't managed to read the doc yet, but just wanted to thank you for
>> the
>> write-up and let you know that I'm very interested in this discussion.
>>
>> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
>> getting as many contributions merged before the release is forked off.
>> When that happened, I'll have more time to read and comment.
>>
>> Thanks,
>> Fabian
>>
>>
>> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>>
>> > Good point, when you start talking about anything less than a full join,
>> > triggers get involved to describe how one actually achieves the desired
>> > semantics, and they may end up being tied to just one of the inputs
>> (e.g.,
>> > you may only care about the watermark for one side of the join). Am
>> > expecting us to address these sorts of details more precisely in doc #2.
>> >
>> > -Tyler
>> >
>> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <klk@google.com.invalid
>> >
>> > wrote:
>> >
>> > > There's something to be said about having different triggering
>> depending
>> > on
>> > > which side of a join data comes from, perhaps?
>> > >
>> > > (delightful doc, as usual)
>> > >
>> > > Kenn
>> > >
>> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
>> <takidau@google.com.invalid
>> > >
>> > > wrote:
>> > >
>> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
>> basically
>> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
>> > input
>> > > > streams into a single output stream. GBK then groups the data within
>> > that
>> > > > single union stream as you might otherwise expect, yielding a single
>> > > table.
>> > > > So I think it doesn't really impact things much. Grouping,
>> aggregation,
>> > > > window merging etc all just act upon the merged stream and generate
>> > what
>> > > is
>> > > > effectively a merged table.
>> > > >
>> > > > -Tyler
>> > > >
>> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
>> <lcwik@google.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > > > The doc is a good read.
>> > > > >
>> > > > > I think you do a great job of explaining table -> stream, stream
>> ->
>> > > > stream,
>> > > > > and stream -> table when there is only one stream.
>> > > > > But when there are multiple streams reading/writing to a table,
>> how
>> > > does
>> > > > > that impact what occurs?
>> > > > > For example, with CoGBK you have multiple streams writing to a
>> table,
>> > > how
>> > > > > does that impact window merging?
>> > > > >
>> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
>> > > <takidau@google.com.invalid
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello Beam, Calcite, and Flink dev lists!
>> > > > > >
>> > > > > > Apologies for the big cross post, but I thought this might be
>> > > something
>> > > > > all
>> > > > > > three communities would find relevant.
>> > > > > >
>> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
>> > > thanks
>> > > > to
>> > > > > > Mingmin Xu. As you can imagine, we need to come to some
>> conclusion
>> > > > about
>> > > > > > how to elegantly support the full suite of streaming
>> functionality
>> > in
>> > > > the
>> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
>> > have
>> > > > been
>> > > > > > pushing on this (e.g., adding windowing constructs, amongst
>> others,
>> > > > thank
>> > > > > > you! :-), but from my understanding we still don't have a full
>> spec
>> > > for
>> > > > > how
>> > > > > > to support robust streaming in SQL (including but not limited
>> to,
>> > > > e.g., a
>> > > > > > triggers analogue such as EMIT).
>> > > > > >
>> > > > > > I've been spending a lot of time thinking about this and have
>> some
>> > > > > opinions
>> > > > > > about how I think it should look that I've already written down,
>> > so I
>> > > > > > volunteered to try to drive forward agreement on a general
>> > streaming
>> > > > SQL
>> > > > > > spec between our three communities (well, technically I
>> volunteered
>> > > to
>> > > > do
>> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
>> want
>> > to
>> > > > > join
>> > > > > > in since you're going that direction already anyway and will
>> have
>> > > > useful
>> > > > > > insights :-).
>> > > > > >
>> > > > > > My plan was to do this by sharing two docs:
>> > > > > >
>> > > > > >    1. The Beam Model : Streams & Tables - This one is for
>> context,
>> > > and
>> > > > > >    really only mentions SQL in passing. But it describes the
>> > > > relationship
>> > > > > >    between the Beam Model and the "streams & tables" way of
>> > thinking,
>> > > > > which
>> > > > > >    turns out to be useful in understanding what robust
>> streaming in
>> > > SQL
>> > > > > > might
>> > > > > >    look like. Many of you probably already know some or all of
>> > what's
>> > > > in
>> > > > > > here,
>> > > > > >    but I felt it was necessary to have it all written down in
>> order
>> > > to
>> > > > > > justify
>> > > > > >    some of the proposals I wanted to make in the second doc.
>> > > > > >
>> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
>> is
>> > > that
>> > > > it
>> > > > > >    would become a general specification for what robust
>> streaming
>> > SQL
>> > > > in
>> > > > > >    Calcite should look like. It would start out as a basic
>> proposal
>> > > of
>> > > > > what
>> > > > > >    things *could* look like (combining both what things look
>> like
>> > now
>> > > > as
>> > > > > > well
>> > > > > >    as a set of proposed changes for the future), and we could
>> all
>> > > > iterate
>> > > > > > on
>> > > > > >    it together until we get to something we're happy with.
>> > > > > >
>> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
>> > so I
>> > > > > > figured I'd share it and let folks hack at it with comments if
>> they
>> > > > have
>> > > > > > any, while I try to get the second doc ready in the meantime. As
>> > part
>> > > > of
>> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
>> to
>> > > > gather
>> > > > > > input on what things are already in flight for streaming SQL
>> across
>> > > the
>> > > > > > various communities, to make sure the proposal captures
>> everything
>> > > > that's
>> > > > > > going on as accurately as it can.
>> > > > > >
>> > > > > > If you have any questions or comments, I'm interested to hear
>> them.
>> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>> > > > > >
>> > > > > >   http://s.apache.org/beam-streams-tables
>> > > > > >
>> > > > > > -Tyler
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Any thoughts here Fabian? I'm planning to start sending out some more
emails towards the end of the week.

-Tyler


On Wed, Apr 26, 2017 at 8:18 AM Tyler Akidau <ta...@google.com> wrote:

> No worries, thanks for the heads up. Good luck wrapping all that stuff up.
>
> -Tyler
>
> On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Tyler,
>>
>> thanks for pushing this effort and including the Flink list.
>> I haven't managed to read the doc yet, but just wanted to thank you for
>> the
>> write-up and let you know that I'm very interested in this discussion.
>>
>> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
>> getting as many contributions merged before the release is forked off.
>> When that happened, I'll have more time to read and comment.
>>
>> Thanks,
>> Fabian
>>
>>
>> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>>
>> > Good point, when you start talking about anything less than a full join,
>> > triggers get involved to describe how one actually achieves the desired
>> > semantics, and they may end up being tied to just one of the inputs
>> (e.g.,
>> > you may only care about the watermark for one side of the join). Am
>> > expecting us to address these sorts of details more precisely in doc #2.
>> >
>> > -Tyler
>> >
>> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <klk@google.com.invalid
>> >
>> > wrote:
>> >
>> > > There's something to be said about having different triggering
>> depending
>> > on
>> > > which side of a join data comes from, perhaps?
>> > >
>> > > (delightful doc, as usual)
>> > >
>> > > Kenn
>> > >
>> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
>> <takidau@google.com.invalid
>> > >
>> > > wrote:
>> > >
>> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
>> basically
>> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
>> > input
>> > > > streams into a single output stream. GBK then groups the data within
>> > that
>> > > > single union stream as you might otherwise expect, yielding a single
>> > > table.
>> > > > So I think it doesn't really impact things much. Grouping,
>> aggregation,
>> > > > window merging etc all just act upon the merged stream and generate
>> > what
>> > > is
>> > > > effectively a merged table.
>> > > >
>> > > > -Tyler
>> > > >
>> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
>> <lcwik@google.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > > > The doc is a good read.
>> > > > >
>> > > > > I think you do a great job of explaining table -> stream, stream
>> ->
>> > > > stream,
>> > > > > and stream -> table when there is only one stream.
>> > > > > But when there are multiple streams reading/writing to a table,
>> how
>> > > does
>> > > > > that impact what occurs?
>> > > > > For example, with CoGBK you have multiple streams writing to a
>> table,
>> > > how
>> > > > > does that impact window merging?
>> > > > >
>> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
>> > > <takidau@google.com.invalid
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello Beam, Calcite, and Flink dev lists!
>> > > > > >
>> > > > > > Apologies for the big cross post, but I thought this might be
>> > > something
>> > > > > all
>> > > > > > three communities would find relevant.
>> > > > > >
>> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
>> > > thanks
>> > > > to
>> > > > > > Mingmin Xu. As you can imagine, we need to come to some
>> conclusion
>> > > > about
>> > > > > > how to elegantly support the full suite of streaming
>> functionality
>> > in
>> > > > the
>> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
>> > have
>> > > > been
>> > > > > > pushing on this (e.g., adding windowing constructs, amongst
>> others,
>> > > > thank
>> > > > > > you! :-), but from my understanding we still don't have a full
>> spec
>> > > for
>> > > > > how
>> > > > > > to support robust streaming in SQL (including but not limited
>> to,
>> > > > e.g., a
>> > > > > > triggers analogue such as EMIT).
>> > > > > >
>> > > > > > I've been spending a lot of time thinking about this and have
>> some
>> > > > > opinions
>> > > > > > about how I think it should look that I've already written down,
>> > so I
>> > > > > > volunteered to try to drive forward agreement on a general
>> > streaming
>> > > > SQL
>> > > > > > spec between our three communities (well, technically I
>> volunteered
>> > > to
>> > > > do
>> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
>> want
>> > to
>> > > > > join
>> > > > > > in since you're going that direction already anyway and will
>> have
>> > > > useful
>> > > > > > insights :-).
>> > > > > >
>> > > > > > My plan was to do this by sharing two docs:
>> > > > > >
>> > > > > >    1. The Beam Model : Streams & Tables - This one is for
>> context,
>> > > and
>> > > > > >    really only mentions SQL in passing. But it describes the
>> > > > relationship
>> > > > > >    between the Beam Model and the "streams & tables" way of
>> > thinking,
>> > > > > which
>> > > > > >    turns out to be useful in understanding what robust
>> streaming in
>> > > SQL
>> > > > > > might
>> > > > > >    look like. Many of you probably already know some or all of
>> > what's
>> > > > in
>> > > > > > here,
>> > > > > >    but I felt it was necessary to have it all written down in
>> order
>> > > to
>> > > > > > justify
>> > > > > >    some of the proposals I wanted to make in the second doc.
>> > > > > >
>> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc
>> is
>> > > that
>> > > > it
>> > > > > >    would become a general specification for what robust
>> streaming
>> > SQL
>> > > > in
>> > > > > >    Calcite should look like. It would start out as a basic
>> proposal
>> > > of
>> > > > > what
>> > > > > >    things *could* look like (combining both what things look
>> like
>> > now
>> > > > as
>> > > > > > well
>> > > > > >    as a set of proposed changes for the future), and we could
>> all
>> > > > iterate
>> > > > > > on
>> > > > > >    it together until we get to something we're happy with.
>> > > > > >
>> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
>> > so I
>> > > > > > figured I'd share it and let folks hack at it with comments if
>> they
>> > > > have
>> > > > > > any, while I try to get the second doc ready in the meantime. As
>> > part
>> > > > of
>> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
>> to
>> > > > gather
>> > > > > > input on what things are already in flight for streaming SQL
>> across
>> > > the
>> > > > > > various communities, to make sure the proposal captures
>> everything
>> > > > that's
>> > > > > > going on as accurately as it can.
>> > > > > >
>> > > > > > If you have any questions or comments, I'm interested to hear
>> them.
>> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>> > > > > >
>> > > > > >   http://s.apache.org/beam-streams-tables
>> > > > > >
>> > > > > > -Tyler
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

No worries, thanks for the heads up. Good luck wrapping all that stuff up.

-Tyler

On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thanks for pushing this effort and including the Flink list.
> I haven't managed to read the doc yet, but just wanted to thank you for the
> write-up and let you know that I'm very interested in this discussion.
>
> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> getting as many contributions merged before the release is forked off.
> When that happened, I'll have more time to read and comment.
>
> Thanks,
> Fabian
>
>
> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Good point, when you start talking about anything less than a full join,
> > triggers get involved to describe how one actually achieves the desired
> > semantics, and they may end up being tied to just one of the inputs
> (e.g.,
> > you may only care about the watermark for one side of the join). Am
> > expecting us to address these sorts of details more precisely in doc #2.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > There's something to be said about having different triggering
> depending
> > on
> > > which side of a join data comes from, perhaps?
> > >
> > > (delightful doc, as usual)
> > >
> > > Kenn
> > >
> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> basically
> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> > input
> > > > streams into a single output stream. GBK then groups the data within
> > that
> > > > single union stream as you might otherwise expect, yielding a single
> > > table.
> > > > So I think it doesn't really impact things much. Grouping,
> aggregation,
> > > > window merging etc all just act upon the merged stream and generate
> > what
> > > is
> > > > effectively a merged table.
> > > >
> > > > -Tyler
> > > >
> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> <lcwik@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > The doc is a good read.
> > > > >
> > > > > I think you do a great job of explaining table -> stream, stream ->
> > > > stream,
> > > > > and stream -> table when there is only one stream.
> > > > > But when there are multiple streams reading/writing to a table, how
> > > does
> > > > > that impact what occurs?
> > > > > For example, with CoGBK you have multiple streams writing to a
> table,
> > > how
> > > > > does that impact window merging?
> > > > >
> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > <takidau@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >
> > > > > > Apologies for the big cross post, but I thought this might be
> > > something
> > > > > all
> > > > > > three communities would find relevant.
> > > > > >
> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > > thanks
> > > > to
> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> conclusion
> > > > about
> > > > > > how to elegantly support the full suite of streaming
> functionality
> > in
> > > > the
> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
> > have
> > > > been
> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> others,
> > > > thank
> > > > > > you! :-), but from my understanding we still don't have a full
> spec
> > > for
> > > > > how
> > > > > > to support robust streaming in SQL (including but not limited to,
> > > > e.g., a
> > > > > > triggers analogue such as EMIT).
> > > > > >
> > > > > > I've been spending a lot of time thinking about this and have
> some
> > > > > opinions
> > > > > > about how I think it should look that I've already written down,
> > so I
> > > > > > volunteered to try to drive forward agreement on a general
> > streaming
> > > > SQL
> > > > > > spec between our three communities (well, technically I
> volunteered
> > > to
> > > > do
> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> want
> > to
> > > > > join
> > > > > > in since you're going that direction already anyway and will have
> > > > useful
> > > > > > insights :-).
> > > > > >
> > > > > > My plan was to do this by sharing two docs:
> > > > > >
> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> context,
> > > and
> > > > > >    really only mentions SQL in passing. But it describes the
> > > > relationship
> > > > > >    between the Beam Model and the "streams & tables" way of
> > thinking,
> > > > > which
> > > > > >    turns out to be useful in understanding what robust streaming
> in
> > > SQL
> > > > > > might
> > > > > >    look like. Many of you probably already know some or all of
> > what's
> > > > in
> > > > > > here,
> > > > > >    but I felt it was necessary to have it all written down in
> order
> > > to
> > > > > > justify
> > > > > >    some of the proposals I wanted to make in the second doc.
> > > > > >
> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > > that
> > > > it
> > > > > >    would become a general specification for what robust streaming
> > SQL
> > > > in
> > > > > >    Calcite should look like. It would start out as a basic
> proposal
> > > of
> > > > > what
> > > > > >    things *could* look like (combining both what things look like
> > now
> > > > as
> > > > > > well
> > > > > >    as a set of proposed changes for the future), and we could all
> > > > iterate
> > > > > > on
> > > > > >    it together until we get to something we're happy with.
> > > > > >
> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> > so I
> > > > > > figured I'd share it and let folks hack at it with comments if
> they
> > > > have
> > > > > > any, while I try to get the second doc ready in the meantime. As
> > part
> > > > of
> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
> to
> > > > gather
> > > > > > input on what things are already in flight for streaming SQL
> across
> > > the
> > > > > > various communities, to make sure the proposal captures
> everything
> > > > that's
> > > > > > going on as accurately as it can.
> > > > > >
> > > > > > If you have any questions or comments, I'm interested to hear
> them.
> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > > >
> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

No worries, thanks for the heads up. Good luck wrapping all that stuff up.

-Tyler

On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thanks for pushing this effort and including the Flink list.
> I haven't managed to read the doc yet, but just wanted to thank you for the
> write-up and let you know that I'm very interested in this discussion.
>
> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> getting as many contributions merged before the release is forked off.
> When that happened, I'll have more time to read and comment.
>
> Thanks,
> Fabian
>
>
> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Good point, when you start talking about anything less than a full join,
> > triggers get involved to describe how one actually achieves the desired
> > semantics, and they may end up being tied to just one of the inputs
> (e.g.,
> > you may only care about the watermark for one side of the join). Am
> > expecting us to address these sorts of details more precisely in doc #2.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > There's something to be said about having different triggering
> depending
> > on
> > > which side of a join data comes from, perhaps?
> > >
> > > (delightful doc, as usual)
> > >
> > > Kenn
> > >
> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> basically
> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> > input
> > > > streams into a single output stream. GBK then groups the data within
> > that
> > > > single union stream as you might otherwise expect, yielding a single
> > > table.
> > > > So I think it doesn't really impact things much. Grouping,
> aggregation,
> > > > window merging etc all just act upon the merged stream and generate
> > what
> > > is
> > > > effectively a merged table.
> > > >
> > > > -Tyler
> > > >
> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> <lcwik@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > The doc is a good read.
> > > > >
> > > > > I think you do a great job of explaining table -> stream, stream ->
> > > > stream,
> > > > > and stream -> table when there is only one stream.
> > > > > But when there are multiple streams reading/writing to a table, how
> > > does
> > > > > that impact what occurs?
> > > > > For example, with CoGBK you have multiple streams writing to a
> table,
> > > how
> > > > > does that impact window merging?
> > > > >
> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > <takidau@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >
> > > > > > Apologies for the big cross post, but I thought this might be
> > > something
> > > > > all
> > > > > > three communities would find relevant.
> > > > > >
> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > > thanks
> > > > to
> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> conclusion
> > > > about
> > > > > > how to elegantly support the full suite of streaming
> functionality
> > in
> > > > the
> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
> > have
> > > > been
> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> others,
> > > > thank
> > > > > > you! :-), but from my understanding we still don't have a full
> spec
> > > for
> > > > > how
> > > > > > to support robust streaming in SQL (including but not limited to,
> > > > e.g., a
> > > > > > triggers analogue such as EMIT).
> > > > > >
> > > > > > I've been spending a lot of time thinking about this and have
> some
> > > > > opinions
> > > > > > about how I think it should look that I've already written down,
> > so I
> > > > > > volunteered to try to drive forward agreement on a general
> > streaming
> > > > SQL
> > > > > > spec between our three communities (well, technically I
> volunteered
> > > to
> > > > do
> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> want
> > to
> > > > > join
> > > > > > in since you're going that direction already anyway and will have
> > > > useful
> > > > > > insights :-).
> > > > > >
> > > > > > My plan was to do this by sharing two docs:
> > > > > >
> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> context,
> > > and
> > > > > >    really only mentions SQL in passing. But it describes the
> > > > relationship
> > > > > >    between the Beam Model and the "streams & tables" way of
> > thinking,
> > > > > which
> > > > > >    turns out to be useful in understanding what robust streaming
> in
> > > SQL
> > > > > > might
> > > > > >    look like. Many of you probably already know some or all of
> > what's
> > > > in
> > > > > > here,
> > > > > >    but I felt it was necessary to have it all written down in
> order
> > > to
> > > > > > justify
> > > > > >    some of the proposals I wanted to make in the second doc.
> > > > > >
> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > > that
> > > > it
> > > > > >    would become a general specification for what robust streaming
> > SQL
> > > > in
> > > > > >    Calcite should look like. It would start out as a basic
> proposal
> > > of
> > > > > what
> > > > > >    things *could* look like (combining both what things look like
> > now
> > > > as
> > > > > > well
> > > > > >    as a set of proposed changes for the future), and we could all
> > > > iterate
> > > > > > on
> > > > > >    it together until we get to something we're happy with.
> > > > > >
> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> > so I
> > > > > > figured I'd share it and let folks hack at it with comments if
> they
> > > > have
> > > > > > any, while I try to get the second doc ready in the meantime. As
> > part
> > > > of
> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
> to
> > > > gather
> > > > > > input on what things are already in flight for streaming SQL
> across
> > > the
> > > > > > various communities, to make sure the proposal captures
> everything
> > > > that's
> > > > > > going on as accurately as it can.
> > > > > >
> > > > > > If you have any questions or comments, I'm interested to hear
> them.
> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > > >
> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

No worries, thanks for the heads up. Good luck wrapping all that stuff up.

-Tyler

On Tue, Apr 25, 2017 at 12:07 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Tyler,
>
> thanks for pushing this effort and including the Flink list.
> I haven't managed to read the doc yet, but just wanted to thank you for the
> write-up and let you know that I'm very interested in this discussion.
>
> We are very close to the feature freeze of Flink 1.3 and I'm quite busy
> getting as many contributions merged before the release is forked off.
> When that happened, I'll have more time to read and comment.
>
> Thanks,
> Fabian
>
>
> 2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:
>
> > Good point, when you start talking about anything less than a full join,
> > triggers get involved to describe how one actually achieves the desired
> > semantics, and they may end up being tied to just one of the inputs
> (e.g.,
> > you may only care about the watermark for one side of the join). Am
> > expecting us to address these sorts of details more precisely in doc #2.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > There's something to be said about having different triggering
> depending
> > on
> > > which side of a join data comes from, perhaps?
> > >
> > > (delightful doc, as usual)
> > >
> > > Kenn
> > >
> > > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Thanks for reading, Luke. The simple answer is that CoGBK is
> basically
> > > > flatten + GBK. Flatten is a non-grouping operation that merges the
> > input
> > > > streams into a single output stream. GBK then groups the data within
> > that
> > > > single union stream as you might otherwise expect, yielding a single
> > > table.
> > > > So I think it doesn't really impact things much. Grouping,
> aggregation,
> > > > window merging etc all just act upon the merged stream and generate
> > what
> > > is
> > > > effectively a merged table.
> > > >
> > > > -Tyler
> > > >
> > > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik
> <lcwik@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > The doc is a good read.
> > > > >
> > > > > I think you do a great job of explaining table -> stream, stream ->
> > > > stream,
> > > > > and stream -> table when there is only one stream.
> > > > > But when there are multiple streams reading/writing to a table, how
> > > does
> > > > > that impact what occurs?
> > > > > For example, with CoGBK you have multiple streams writing to a
> table,
> > > how
> > > > > does that impact window merging?
> > > > >
> > > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > > <takidau@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > > >
> > > > > > Apologies for the big cross post, but I thought this might be
> > > something
> > > > > all
> > > > > > three communities would find relevant.
> > > > > >
> > > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > > thanks
> > > > to
> > > > > > Mingmin Xu. As you can imagine, we need to come to some
> conclusion
> > > > about
> > > > > > how to elegantly support the full suite of streaming
> functionality
> > in
> > > > the
> > > > > > Beam model in via Calcite SQL. You folks in the Flink community
> > have
> > > > been
> > > > > > pushing on this (e.g., adding windowing constructs, amongst
> others,
> > > > thank
> > > > > > you! :-), but from my understanding we still don't have a full
> spec
> > > for
> > > > > how
> > > > > > to support robust streaming in SQL (including but not limited to,
> > > > e.g., a
> > > > > > triggers analogue such as EMIT).
> > > > > >
> > > > > > I've been spending a lot of time thinking about this and have
> some
> > > > > opinions
> > > > > > about how I think it should look that I've already written down,
> > so I
> > > > > > volunteered to try to drive forward agreement on a general
> > streaming
> > > > SQL
> > > > > > spec between our three communities (well, technically I
> volunteered
> > > to
> > > > do
> > > > > > that w/ Beam and Calcite, but I figured you Flink folks might
> want
> > to
> > > > > join
> > > > > > in since you're going that direction already anyway and will have
> > > > useful
> > > > > > insights :-).
> > > > > >
> > > > > > My plan was to do this by sharing two docs:
> > > > > >
> > > > > >    1. The Beam Model : Streams & Tables - This one is for
> context,
> > > and
> > > > > >    really only mentions SQL in passing. But it describes the
> > > > relationship
> > > > > >    between the Beam Model and the "streams & tables" way of
> > thinking,
> > > > > which
> > > > > >    turns out to be useful in understanding what robust streaming
> in
> > > SQL
> > > > > > might
> > > > > >    look like. Many of you probably already know some or all of
> > what's
> > > > in
> > > > > > here,
> > > > > >    but I felt it was necessary to have it all written down in
> order
> > > to
> > > > > > justify
> > > > > >    some of the proposals I wanted to make in the second doc.
> > > > > >
> > > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > > that
> > > > it
> > > > > >    would become a general specification for what robust streaming
> > SQL
> > > > in
> > > > > >    Calcite should look like. It would start out as a basic
> proposal
> > > of
> > > > > what
> > > > > >    things *could* look like (combining both what things look like
> > now
> > > > as
> > > > > > well
> > > > > >    as a set of proposed changes for the future), and we could all
> > > > iterate
> > > > > > on
> > > > > >    it together until we get to something we're happy with.
> > > > > >
> > > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> > so I
> > > > > > figured I'd share it and let folks hack at it with comments if
> they
> > > > have
> > > > > > any, while I try to get the second doc ready in the meantime. As
> > part
> > > > of
> > > > > > getting doc #2 ready, I'll be starting a separate thread to try
> to
> > > > gather
> > > > > > input on what things are already in flight for streaming SQL
> across
> > > the
> > > > > > various communities, to make sure the proposal captures
> everything
> > > > that's
> > > > > > going on as accurately as it can.
> > > > > >
> > > > > > If you have any questions or comments, I'm interested to hear
> them.
> > > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > > >
> > > > > >   http://s.apache.org/beam-streams-tables
> > > > > >
> > > > > > -Tyler
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thanks for pushing this effort and including the Flink list.
I haven't managed to read the doc yet, but just wanted to thank you for the
write-up and let you know that I'm very interested in this discussion.

We are very close to the feature freeze of Flink 1.3 and I'm quite busy
getting as many contributions merged before the release is forked off.
When that happened, I'll have more time to read and comment.

Thanks,
Fabian


2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Good point, when you start talking about anything less than a full join,
> triggers get involved to describe how one actually achieves the desired
> semantics, and they may end up being tied to just one of the inputs (e.g.,
> you may only care about the watermark for one side of the join). Am
> expecting us to address these sorts of details more precisely in doc #2.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > There's something to be said about having different triggering depending
> on
> > which side of a join data comes from, perhaps?
> >
> > (delightful doc, as usual)
> >
> > Kenn
> >
> > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > > flatten + GBK. Flatten is a non-grouping operation that merges the
> input
> > > streams into a single output stream. GBK then groups the data within
> that
> > > single union stream as you might otherwise expect, yielding a single
> > table.
> > > So I think it doesn't really impact things much. Grouping, aggregation,
> > > window merging etc all just act upon the merged stream and generate
> what
> > is
> > > effectively a merged table.
> > >
> > > -Tyler
> > >
> > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lcwik@google.com.invalid
> >
> > > wrote:
> > >
> > > > The doc is a good read.
> > > >
> > > > I think you do a great job of explaining table -> stream, stream ->
> > > stream,
> > > > and stream -> table when there is only one stream.
> > > > But when there are multiple streams reading/writing to a table, how
> > does
> > > > that impact what occurs?
> > > > For example, with CoGBK you have multiple streams writing to a table,
> > how
> > > > does that impact window merging?
> > > >
> > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > <takidau@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >
> > > > > Apologies for the big cross post, but I thought this might be
> > something
> > > > all
> > > > > three communities would find relevant.
> > > > >
> > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > thanks
> > > to
> > > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > > about
> > > > > how to elegantly support the full suite of streaming functionality
> in
> > > the
> > > > > Beam model in via Calcite SQL. You folks in the Flink community
> have
> > > been
> > > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > > thank
> > > > > you! :-), but from my understanding we still don't have a full spec
> > for
> > > > how
> > > > > to support robust streaming in SQL (including but not limited to,
> > > e.g., a
> > > > > triggers analogue such as EMIT).
> > > > >
> > > > > I've been spending a lot of time thinking about this and have some
> > > > opinions
> > > > > about how I think it should look that I've already written down,
> so I
> > > > > volunteered to try to drive forward agreement on a general
> streaming
> > > SQL
> > > > > spec between our three communities (well, technically I volunteered
> > to
> > > do
> > > > > that w/ Beam and Calcite, but I figured you Flink folks might want
> to
> > > > join
> > > > > in since you're going that direction already anyway and will have
> > > useful
> > > > > insights :-).
> > > > >
> > > > > My plan was to do this by sharing two docs:
> > > > >
> > > > >    1. The Beam Model : Streams & Tables - This one is for context,
> > and
> > > > >    really only mentions SQL in passing. But it describes the
> > > relationship
> > > > >    between the Beam Model and the "streams & tables" way of
> thinking,
> > > > which
> > > > >    turns out to be useful in understanding what robust streaming in
> > SQL
> > > > > might
> > > > >    look like. Many of you probably already know some or all of
> what's
> > > in
> > > > > here,
> > > > >    but I felt it was necessary to have it all written down in order
> > to
> > > > > justify
> > > > >    some of the proposals I wanted to make in the second doc.
> > > > >
> > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > that
> > > it
> > > > >    would become a general specification for what robust streaming
> SQL
> > > in
> > > > >    Calcite should look like. It would start out as a basic proposal
> > of
> > > > what
> > > > >    things *could* look like (combining both what things look like
> now
> > > as
> > > > > well
> > > > >    as a set of proposed changes for the future), and we could all
> > > iterate
> > > > > on
> > > > >    it together until we get to something we're happy with.
> > > > >
> > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> so I
> > > > > figured I'd share it and let folks hack at it with comments if they
> > > have
> > > > > any, while I try to get the second doc ready in the meantime. As
> part
> > > of
> > > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > > gather
> > > > > input on what things are already in flight for streaming SQL across
> > the
> > > > > various communities, to make sure the proposal captures everything
> > > that's
> > > > > going on as accurately as it can.
> > > > >
> > > > > If you have any questions or comments, I'm interested to hear them.
> > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > >
> > > > >   http://s.apache.org/beam-streams-tables
> > > > >
> > > > > -Tyler
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thanks for pushing this effort and including the Flink list.
I haven't managed to read the doc yet, but just wanted to thank you for the
write-up and let you know that I'm very interested in this discussion.

We are very close to the feature freeze of Flink 1.3 and I'm quite busy
getting as many contributions merged before the release is forked off.
When that happened, I'll have more time to read and comment.

Thanks,
Fabian


2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Good point, when you start talking about anything less than a full join,
> triggers get involved to describe how one actually achieves the desired
> semantics, and they may end up being tied to just one of the inputs (e.g.,
> you may only care about the watermark for one side of the join). Am
> expecting us to address these sorts of details more precisely in doc #2.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > There's something to be said about having different triggering depending
> on
> > which side of a join data comes from, perhaps?
> >
> > (delightful doc, as usual)
> >
> > Kenn
> >
> > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > > flatten + GBK. Flatten is a non-grouping operation that merges the
> input
> > > streams into a single output stream. GBK then groups the data within
> that
> > > single union stream as you might otherwise expect, yielding a single
> > table.
> > > So I think it doesn't really impact things much. Grouping, aggregation,
> > > window merging etc all just act upon the merged stream and generate
> what
> > is
> > > effectively a merged table.
> > >
> > > -Tyler
> > >
> > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lcwik@google.com.invalid
> >
> > > wrote:
> > >
> > > > The doc is a good read.
> > > >
> > > > I think you do a great job of explaining table -> stream, stream ->
> > > stream,
> > > > and stream -> table when there is only one stream.
> > > > But when there are multiple streams reading/writing to a table, how
> > does
> > > > that impact what occurs?
> > > > For example, with CoGBK you have multiple streams writing to a table,
> > how
> > > > does that impact window merging?
> > > >
> > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > <takidau@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >
> > > > > Apologies for the big cross post, but I thought this might be
> > something
> > > > all
> > > > > three communities would find relevant.
> > > > >
> > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > thanks
> > > to
> > > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > > about
> > > > > how to elegantly support the full suite of streaming functionality
> in
> > > the
> > > > > Beam model in via Calcite SQL. You folks in the Flink community
> have
> > > been
> > > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > > thank
> > > > > you! :-), but from my understanding we still don't have a full spec
> > for
> > > > how
> > > > > to support robust streaming in SQL (including but not limited to,
> > > e.g., a
> > > > > triggers analogue such as EMIT).
> > > > >
> > > > > I've been spending a lot of time thinking about this and have some
> > > > opinions
> > > > > about how I think it should look that I've already written down,
> so I
> > > > > volunteered to try to drive forward agreement on a general
> streaming
> > > SQL
> > > > > spec between our three communities (well, technically I volunteered
> > to
> > > do
> > > > > that w/ Beam and Calcite, but I figured you Flink folks might want
> to
> > > > join
> > > > > in since you're going that direction already anyway and will have
> > > useful
> > > > > insights :-).
> > > > >
> > > > > My plan was to do this by sharing two docs:
> > > > >
> > > > >    1. The Beam Model : Streams & Tables - This one is for context,
> > and
> > > > >    really only mentions SQL in passing. But it describes the
> > > relationship
> > > > >    between the Beam Model and the "streams & tables" way of
> thinking,
> > > > which
> > > > >    turns out to be useful in understanding what robust streaming in
> > SQL
> > > > > might
> > > > >    look like. Many of you probably already know some or all of
> what's
> > > in
> > > > > here,
> > > > >    but I felt it was necessary to have it all written down in order
> > to
> > > > > justify
> > > > >    some of the proposals I wanted to make in the second doc.
> > > > >
> > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > that
> > > it
> > > > >    would become a general specification for what robust streaming
> SQL
> > > in
> > > > >    Calcite should look like. It would start out as a basic proposal
> > of
> > > > what
> > > > >    things *could* look like (combining both what things look like
> now
> > > as
> > > > > well
> > > > >    as a set of proposed changes for the future), and we could all
> > > iterate
> > > > > on
> > > > >    it together until we get to something we're happy with.
> > > > >
> > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> so I
> > > > > figured I'd share it and let folks hack at it with comments if they
> > > have
> > > > > any, while I try to get the second doc ready in the meantime. As
> part
> > > of
> > > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > > gather
> > > > > input on what things are already in flight for streaming SQL across
> > the
> > > > > various communities, to make sure the proposal captures everything
> > > that's
> > > > > going on as accurately as it can.
> > > > >
> > > > > If you have any questions or comments, I'm interested to hear them.
> > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > >
> > > > >   http://s.apache.org/beam-streams-tables
> > > > >
> > > > > -Tyler
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Tyler,

thanks for pushing this effort and including the Flink list.
I haven't managed to read the doc yet, but just wanted to thank you for the
write-up and let you know that I'm very interested in this discussion.

We are very close to the feature freeze of Flink 1.3 and I'm quite busy
getting as many contributions merged before the release is forked off.
When that happened, I'll have more time to read and comment.

Thanks,
Fabian


2017-04-22 0:16 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Good point, when you start talking about anything less than a full join,
> triggers get involved to describe how one actually achieves the desired
> semantics, and they may end up being tied to just one of the inputs (e.g.,
> you may only care about the watermark for one side of the join). Am
> expecting us to address these sorts of details more precisely in doc #2.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > There's something to be said about having different triggering depending
> on
> > which side of a join data comes from, perhaps?
> >
> > (delightful doc, as usual)
> >
> > Kenn
> >
> > On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > > flatten + GBK. Flatten is a non-grouping operation that merges the
> input
> > > streams into a single output stream. GBK then groups the data within
> that
> > > single union stream as you might otherwise expect, yielding a single
> > table.
> > > So I think it doesn't really impact things much. Grouping, aggregation,
> > > window merging etc all just act upon the merged stream and generate
> what
> > is
> > > effectively a merged table.
> > >
> > > -Tyler
> > >
> > > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lcwik@google.com.invalid
> >
> > > wrote:
> > >
> > > > The doc is a good read.
> > > >
> > > > I think you do a great job of explaining table -> stream, stream ->
> > > stream,
> > > > and stream -> table when there is only one stream.
> > > > But when there are multiple streams reading/writing to a table, how
> > does
> > > > that impact what occurs?
> > > > For example, with CoGBK you have multiple streams writing to a table,
> > how
> > > > does that impact window merging?
> > > >
> > > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> > <takidau@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hello Beam, Calcite, and Flink dev lists!
> > > > >
> > > > > Apologies for the big cross post, but I thought this might be
> > something
> > > > all
> > > > > three communities would find relevant.
> > > > >
> > > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> > thanks
> > > to
> > > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > > about
> > > > > how to elegantly support the full suite of streaming functionality
> in
> > > the
> > > > > Beam model in via Calcite SQL. You folks in the Flink community
> have
> > > been
> > > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > > thank
> > > > > you! :-), but from my understanding we still don't have a full spec
> > for
> > > > how
> > > > > to support robust streaming in SQL (including but not limited to,
> > > e.g., a
> > > > > triggers analogue such as EMIT).
> > > > >
> > > > > I've been spending a lot of time thinking about this and have some
> > > > opinions
> > > > > about how I think it should look that I've already written down,
> so I
> > > > > volunteered to try to drive forward agreement on a general
> streaming
> > > SQL
> > > > > spec between our three communities (well, technically I volunteered
> > to
> > > do
> > > > > that w/ Beam and Calcite, but I figured you Flink folks might want
> to
> > > > join
> > > > > in since you're going that direction already anyway and will have
> > > useful
> > > > > insights :-).
> > > > >
> > > > > My plan was to do this by sharing two docs:
> > > > >
> > > > >    1. The Beam Model : Streams & Tables - This one is for context,
> > and
> > > > >    really only mentions SQL in passing. But it describes the
> > > relationship
> > > > >    between the Beam Model and the "streams & tables" way of
> thinking,
> > > > which
> > > > >    turns out to be useful in understanding what robust streaming in
> > SQL
> > > > > might
> > > > >    look like. Many of you probably already know some or all of
> what's
> > > in
> > > > > here,
> > > > >    but I felt it was necessary to have it all written down in order
> > to
> > > > > justify
> > > > >    some of the proposals I wanted to make in the second doc.
> > > > >
> > > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> > that
> > > it
> > > > >    would become a general specification for what robust streaming
> SQL
> > > in
> > > > >    Calcite should look like. It would start out as a basic proposal
> > of
> > > > what
> > > > >    things *could* look like (combining both what things look like
> now
> > > as
> > > > > well
> > > > >    as a set of proposed changes for the future), and we could all
> > > iterate
> > > > > on
> > > > >    it together until we get to something we're happy with.
> > > > >
> > > > > At this point, I have doc #1 ready, and it's a bit of a monster,
> so I
> > > > > figured I'd share it and let folks hack at it with comments if they
> > > have
> > > > > any, while I try to get the second doc ready in the meantime. As
> part
> > > of
> > > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > > gather
> > > > > input on what things are already in flight for streaming SQL across
> > the
> > > > > various communities, to make sure the proposal captures everything
> > > that's
> > > > > going on as accurately as it can.
> > > > >
> > > > > If you have any questions or comments, I'm interested to hear them.
> > > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > > >
> > > > >   http://s.apache.org/beam-streams-tables
> > > > >
> > > > > -Tyler
> > > > >
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Good point, when you start talking about anything less than a full join,
triggers get involved to describe how one actually achieves the desired
semantics, and they may end up being tied to just one of the inputs (e.g.,
you may only care about the watermark for one side of the join). Am
expecting us to address these sorts of details more precisely in doc #2.

-Tyler

On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
wrote:

> There's something to be said about having different triggering depending on
> which side of a join data comes from, perhaps?
>
> (delightful doc, as usual)
>
> Kenn
>
> On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > flatten + GBK. Flatten is a non-grouping operation that merges the input
> > streams into a single output stream. GBK then groups the data within that
> > single union stream as you might otherwise expect, yielding a single
> table.
> > So I think it doesn't really impact things much. Grouping, aggregation,
> > window merging etc all just act upon the merged stream and generate what
> is
> > effectively a merged table.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > The doc is a good read.
> > >
> > > I think you do a great job of explaining table -> stream, stream ->
> > stream,
> > > and stream -> table when there is only one stream.
> > > But when there are multiple streams reading/writing to a table, how
> does
> > > that impact what occurs?
> > > For example, with CoGBK you have multiple streams writing to a table,
> how
> > > does that impact window merging?
> > >
> > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Hello Beam, Calcite, and Flink dev lists!
> > > >
> > > > Apologies for the big cross post, but I thought this might be
> something
> > > all
> > > > three communities would find relevant.
> > > >
> > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> thanks
> > to
> > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > about
> > > > how to elegantly support the full suite of streaming functionality in
> > the
> > > > Beam model in via Calcite SQL. You folks in the Flink community have
> > been
> > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > thank
> > > > you! :-), but from my understanding we still don't have a full spec
> for
> > > how
> > > > to support robust streaming in SQL (including but not limited to,
> > e.g., a
> > > > triggers analogue such as EMIT).
> > > >
> > > > I've been spending a lot of time thinking about this and have some
> > > opinions
> > > > about how I think it should look that I've already written down, so I
> > > > volunteered to try to drive forward agreement on a general streaming
> > SQL
> > > > spec between our three communities (well, technically I volunteered
> to
> > do
> > > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > > join
> > > > in since you're going that direction already anyway and will have
> > useful
> > > > insights :-).
> > > >
> > > > My plan was to do this by sharing two docs:
> > > >
> > > >    1. The Beam Model : Streams & Tables - This one is for context,
> and
> > > >    really only mentions SQL in passing. But it describes the
> > relationship
> > > >    between the Beam Model and the "streams & tables" way of thinking,
> > > which
> > > >    turns out to be useful in understanding what robust streaming in
> SQL
> > > > might
> > > >    look like. Many of you probably already know some or all of what's
> > in
> > > > here,
> > > >    but I felt it was necessary to have it all written down in order
> to
> > > > justify
> > > >    some of the proposals I wanted to make in the second doc.
> > > >
> > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> that
> > it
> > > >    would become a general specification for what robust streaming SQL
> > in
> > > >    Calcite should look like. It would start out as a basic proposal
> of
> > > what
> > > >    things *could* look like (combining both what things look like now
> > as
> > > > well
> > > >    as a set of proposed changes for the future), and we could all
> > iterate
> > > > on
> > > >    it together until we get to something we're happy with.
> > > >
> > > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > > figured I'd share it and let folks hack at it with comments if they
> > have
> > > > any, while I try to get the second doc ready in the meantime. As part
> > of
> > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > gather
> > > > input on what things are already in flight for streaming SQL across
> the
> > > > various communities, to make sure the proposal captures everything
> > that's
> > > > going on as accurately as it can.
> > > >
> > > > If you have any questions or comments, I'm interested to hear them.
> > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > >
> > > >   http://s.apache.org/beam-streams-tables
> > > >
> > > > -Tyler
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Good point, when you start talking about anything less than a full join,
triggers get involved to describe how one actually achieves the desired
semantics, and they may end up being tied to just one of the inputs (e.g.,
you may only care about the watermark for one side of the join). Am
expecting us to address these sorts of details more precisely in doc #2.

-Tyler

On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
wrote:

> There's something to be said about having different triggering depending on
> which side of a join data comes from, perhaps?
>
> (delightful doc, as usual)
>
> Kenn
>
> On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > flatten + GBK. Flatten is a non-grouping operation that merges the input
> > streams into a single output stream. GBK then groups the data within that
> > single union stream as you might otherwise expect, yielding a single
> table.
> > So I think it doesn't really impact things much. Grouping, aggregation,
> > window merging etc all just act upon the merged stream and generate what
> is
> > effectively a merged table.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > The doc is a good read.
> > >
> > > I think you do a great job of explaining table -> stream, stream ->
> > stream,
> > > and stream -> table when there is only one stream.
> > > But when there are multiple streams reading/writing to a table, how
> does
> > > that impact what occurs?
> > > For example, with CoGBK you have multiple streams writing to a table,
> how
> > > does that impact window merging?
> > >
> > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Hello Beam, Calcite, and Flink dev lists!
> > > >
> > > > Apologies for the big cross post, but I thought this might be
> something
> > > all
> > > > three communities would find relevant.
> > > >
> > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> thanks
> > to
> > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > about
> > > > how to elegantly support the full suite of streaming functionality in
> > the
> > > > Beam model in via Calcite SQL. You folks in the Flink community have
> > been
> > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > thank
> > > > you! :-), but from my understanding we still don't have a full spec
> for
> > > how
> > > > to support robust streaming in SQL (including but not limited to,
> > e.g., a
> > > > triggers analogue such as EMIT).
> > > >
> > > > I've been spending a lot of time thinking about this and have some
> > > opinions
> > > > about how I think it should look that I've already written down, so I
> > > > volunteered to try to drive forward agreement on a general streaming
> > SQL
> > > > spec between our three communities (well, technically I volunteered
> to
> > do
> > > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > > join
> > > > in since you're going that direction already anyway and will have
> > useful
> > > > insights :-).
> > > >
> > > > My plan was to do this by sharing two docs:
> > > >
> > > >    1. The Beam Model : Streams & Tables - This one is for context,
> and
> > > >    really only mentions SQL in passing. But it describes the
> > relationship
> > > >    between the Beam Model and the "streams & tables" way of thinking,
> > > which
> > > >    turns out to be useful in understanding what robust streaming in
> SQL
> > > > might
> > > >    look like. Many of you probably already know some or all of what's
> > in
> > > > here,
> > > >    but I felt it was necessary to have it all written down in order
> to
> > > > justify
> > > >    some of the proposals I wanted to make in the second doc.
> > > >
> > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> that
> > it
> > > >    would become a general specification for what robust streaming SQL
> > in
> > > >    Calcite should look like. It would start out as a basic proposal
> of
> > > what
> > > >    things *could* look like (combining both what things look like now
> > as
> > > > well
> > > >    as a set of proposed changes for the future), and we could all
> > iterate
> > > > on
> > > >    it together until we get to something we're happy with.
> > > >
> > > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > > figured I'd share it and let folks hack at it with comments if they
> > have
> > > > any, while I try to get the second doc ready in the meantime. As part
> > of
> > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > gather
> > > > input on what things are already in flight for streaming SQL across
> the
> > > > various communities, to make sure the proposal captures everything
> > that's
> > > > going on as accurately as it can.
> > > >
> > > > If you have any questions or comments, I'm interested to hear them.
> > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > >
> > > >   http://s.apache.org/beam-streams-tables
> > > >
> > > > -Tyler
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Good point, when you start talking about anything less than a full join,
triggers get involved to describe how one actually achieves the desired
semantics, and they may end up being tied to just one of the inputs (e.g.,
you may only care about the watermark for one side of the join). Am
expecting us to address these sorts of details more precisely in doc #2.

-Tyler

On Fri, Apr 21, 2017 at 2:26 PM Kenneth Knowles <kl...@google.com.invalid>
wrote:

> There's something to be said about having different triggering depending on
> which side of a join data comes from, perhaps?
>
> (delightful doc, as usual)
>
> Kenn
>
> On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Thanks for reading, Luke. The simple answer is that CoGBK is basically
> > flatten + GBK. Flatten is a non-grouping operation that merges the input
> > streams into a single output stream. GBK then groups the data within that
> > single union stream as you might otherwise expect, yielding a single
> table.
> > So I think it doesn't really impact things much. Grouping, aggregation,
> > window merging etc all just act upon the merged stream and generate what
> is
> > effectively a merged table.
> >
> > -Tyler
> >
> > On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> >
> > > The doc is a good read.
> > >
> > > I think you do a great job of explaining table -> stream, stream ->
> > stream,
> > > and stream -> table when there is only one stream.
> > > But when there are multiple streams reading/writing to a table, how
> does
> > > that impact what occurs?
> > > For example, with CoGBK you have multiple streams writing to a table,
> how
> > > does that impact window merging?
> > >
> > > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau
> <takidau@google.com.invalid
> > >
> > > wrote:
> > >
> > > > Hello Beam, Calcite, and Flink dev lists!
> > > >
> > > > Apologies for the big cross post, but I thought this might be
> something
> > > all
> > > > three communities would find relevant.
> > > >
> > > > Beam is finally making progress on a SQL DSL utilizing Calcite,
> thanks
> > to
> > > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> > about
> > > > how to elegantly support the full suite of streaming functionality in
> > the
> > > > Beam model in via Calcite SQL. You folks in the Flink community have
> > been
> > > > pushing on this (e.g., adding windowing constructs, amongst others,
> > thank
> > > > you! :-), but from my understanding we still don't have a full spec
> for
> > > how
> > > > to support robust streaming in SQL (including but not limited to,
> > e.g., a
> > > > triggers analogue such as EMIT).
> > > >
> > > > I've been spending a lot of time thinking about this and have some
> > > opinions
> > > > about how I think it should look that I've already written down, so I
> > > > volunteered to try to drive forward agreement on a general streaming
> > SQL
> > > > spec between our three communities (well, technically I volunteered
> to
> > do
> > > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > > join
> > > > in since you're going that direction already anyway and will have
> > useful
> > > > insights :-).
> > > >
> > > > My plan was to do this by sharing two docs:
> > > >
> > > >    1. The Beam Model : Streams & Tables - This one is for context,
> and
> > > >    really only mentions SQL in passing. But it describes the
> > relationship
> > > >    between the Beam Model and the "streams & tables" way of thinking,
> > > which
> > > >    turns out to be useful in understanding what robust streaming in
> SQL
> > > > might
> > > >    look like. Many of you probably already know some or all of what's
> > in
> > > > here,
> > > >    but I felt it was necessary to have it all written down in order
> to
> > > > justify
> > > >    some of the proposals I wanted to make in the second doc.
> > > >
> > > >    2. A streaming SQL spec for Calcite - The goal for this doc is
> that
> > it
> > > >    would become a general specification for what robust streaming SQL
> > in
> > > >    Calcite should look like. It would start out as a basic proposal
> of
> > > what
> > > >    things *could* look like (combining both what things look like now
> > as
> > > > well
> > > >    as a set of proposed changes for the future), and we could all
> > iterate
> > > > on
> > > >    it together until we get to something we're happy with.
> > > >
> > > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > > figured I'd share it and let folks hack at it with comments if they
> > have
> > > > any, while I try to get the second doc ready in the meantime. As part
> > of
> > > > getting doc #2 ready, I'll be starting a separate thread to try to
> > gather
> > > > input on what things are already in flight for streaming SQL across
> the
> > > > various communities, to make sure the proposal captures everything
> > that's
> > > > going on as accurately as it can.
> > > >
> > > > If you have any questions or comments, I'm interested to hear them.
> > > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > > >
> > > >   http://s.apache.org/beam-streams-tables
> > > >
> > > > -Tyler
> > > >
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

There's something to be said about having different triggering depending on
which side of a join data comes from, perhaps?

(delightful doc, as usual)

Kenn

On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Thanks for reading, Luke. The simple answer is that CoGBK is basically
> flatten + GBK. Flatten is a non-grouping operation that merges the input
> streams into a single output stream. GBK then groups the data within that
> single union stream as you might otherwise expect, yielding a single table.
> So I think it doesn't really impact things much. Grouping, aggregation,
> window merging etc all just act upon the merged stream and generate what is
> effectively a merged table.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > The doc is a good read.
> >
> > I think you do a great job of explaining table -> stream, stream ->
> stream,
> > and stream -> table when there is only one stream.
> > But when there are multiple streams reading/writing to a table, how does
> > that impact what occurs?
> > For example, with CoGBK you have multiple streams writing to a table, how
> > does that impact window merging?
> >
> > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Hello Beam, Calcite, and Flink dev lists!
> > >
> > > Apologies for the big cross post, but I thought this might be something
> > all
> > > three communities would find relevant.
> > >
> > > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks
> to
> > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> about
> > > how to elegantly support the full suite of streaming functionality in
> the
> > > Beam model in via Calcite SQL. You folks in the Flink community have
> been
> > > pushing on this (e.g., adding windowing constructs, amongst others,
> thank
> > > you! :-), but from my understanding we still don't have a full spec for
> > how
> > > to support robust streaming in SQL (including but not limited to,
> e.g., a
> > > triggers analogue such as EMIT).
> > >
> > > I've been spending a lot of time thinking about this and have some
> > opinions
> > > about how I think it should look that I've already written down, so I
> > > volunteered to try to drive forward agreement on a general streaming
> SQL
> > > spec between our three communities (well, technically I volunteered to
> do
> > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > join
> > > in since you're going that direction already anyway and will have
> useful
> > > insights :-).
> > >
> > > My plan was to do this by sharing two docs:
> > >
> > >    1. The Beam Model : Streams & Tables - This one is for context, and
> > >    really only mentions SQL in passing. But it describes the
> relationship
> > >    between the Beam Model and the "streams & tables" way of thinking,
> > which
> > >    turns out to be useful in understanding what robust streaming in SQL
> > > might
> > >    look like. Many of you probably already know some or all of what's
> in
> > > here,
> > >    but I felt it was necessary to have it all written down in order to
> > > justify
> > >    some of the proposals I wanted to make in the second doc.
> > >
> > >    2. A streaming SQL spec for Calcite - The goal for this doc is that
> it
> > >    would become a general specification for what robust streaming SQL
> in
> > >    Calcite should look like. It would start out as a basic proposal of
> > what
> > >    things *could* look like (combining both what things look like now
> as
> > > well
> > >    as a set of proposed changes for the future), and we could all
> iterate
> > > on
> > >    it together until we get to something we're happy with.
> > >
> > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > figured I'd share it and let folks hack at it with comments if they
> have
> > > any, while I try to get the second doc ready in the meantime. As part
> of
> > > getting doc #2 ready, I'll be starting a separate thread to try to
> gather
> > > input on what things are already in flight for streaming SQL across the
> > > various communities, to make sure the proposal captures everything
> that's
> > > going on as accurately as it can.
> > >
> > > If you have any questions or comments, I'm interested to hear them.
> > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > >
> > >   http://s.apache.org/beam-streams-tables
> > >
> > > -Tyler
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

There's something to be said about having different triggering depending on
which side of a join data comes from, perhaps?

(delightful doc, as usual)

Kenn

On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Thanks for reading, Luke. The simple answer is that CoGBK is basically
> flatten + GBK. Flatten is a non-grouping operation that merges the input
> streams into a single output stream. GBK then groups the data within that
> single union stream as you might otherwise expect, yielding a single table.
> So I think it doesn't really impact things much. Grouping, aggregation,
> window merging etc all just act upon the merged stream and generate what is
> effectively a merged table.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > The doc is a good read.
> >
> > I think you do a great job of explaining table -> stream, stream ->
> stream,
> > and stream -> table when there is only one stream.
> > But when there are multiple streams reading/writing to a table, how does
> > that impact what occurs?
> > For example, with CoGBK you have multiple streams writing to a table, how
> > does that impact window merging?
> >
> > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Hello Beam, Calcite, and Flink dev lists!
> > >
> > > Apologies for the big cross post, but I thought this might be something
> > all
> > > three communities would find relevant.
> > >
> > > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks
> to
> > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> about
> > > how to elegantly support the full suite of streaming functionality in
> the
> > > Beam model in via Calcite SQL. You folks in the Flink community have
> been
> > > pushing on this (e.g., adding windowing constructs, amongst others,
> thank
> > > you! :-), but from my understanding we still don't have a full spec for
> > how
> > > to support robust streaming in SQL (including but not limited to,
> e.g., a
> > > triggers analogue such as EMIT).
> > >
> > > I've been spending a lot of time thinking about this and have some
> > opinions
> > > about how I think it should look that I've already written down, so I
> > > volunteered to try to drive forward agreement on a general streaming
> SQL
> > > spec between our three communities (well, technically I volunteered to
> do
> > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > join
> > > in since you're going that direction already anyway and will have
> useful
> > > insights :-).
> > >
> > > My plan was to do this by sharing two docs:
> > >
> > >    1. The Beam Model : Streams & Tables - This one is for context, and
> > >    really only mentions SQL in passing. But it describes the
> relationship
> > >    between the Beam Model and the "streams & tables" way of thinking,
> > which
> > >    turns out to be useful in understanding what robust streaming in SQL
> > > might
> > >    look like. Many of you probably already know some or all of what's
> in
> > > here,
> > >    but I felt it was necessary to have it all written down in order to
> > > justify
> > >    some of the proposals I wanted to make in the second doc.
> > >
> > >    2. A streaming SQL spec for Calcite - The goal for this doc is that
> it
> > >    would become a general specification for what robust streaming SQL
> in
> > >    Calcite should look like. It would start out as a basic proposal of
> > what
> > >    things *could* look like (combining both what things look like now
> as
> > > well
> > >    as a set of proposed changes for the future), and we could all
> iterate
> > > on
> > >    it together until we get to something we're happy with.
> > >
> > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > figured I'd share it and let folks hack at it with comments if they
> have
> > > any, while I try to get the second doc ready in the meantime. As part
> of
> > > getting doc #2 ready, I'll be starting a separate thread to try to
> gather
> > > input on what things are already in flight for streaming SQL across the
> > > various communities, to make sure the proposal captures everything
> that's
> > > going on as accurately as it can.
> > >
> > > If you have any questions or comments, I'm interested to hear them.
> > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > >
> > >   http://s.apache.org/beam-streams-tables
> > >
> > > -Tyler
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

There's something to be said about having different triggering depending on
which side of a join data comes from, perhaps?

(delightful doc, as usual)

Kenn

On Fri, Apr 21, 2017 at 1:33 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Thanks for reading, Luke. The simple answer is that CoGBK is basically
> flatten + GBK. Flatten is a non-grouping operation that merges the input
> streams into a single output stream. GBK then groups the data within that
> single union stream as you might otherwise expect, yielding a single table.
> So I think it doesn't really impact things much. Grouping, aggregation,
> window merging etc all just act upon the merged stream and generate what is
> effectively a merged table.
>
> -Tyler
>
> On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > The doc is a good read.
> >
> > I think you do a great job of explaining table -> stream, stream ->
> stream,
> > and stream -> table when there is only one stream.
> > But when there are multiple streams reading/writing to a table, how does
> > that impact what occurs?
> > For example, with CoGBK you have multiple streams writing to a table, how
> > does that impact window merging?
> >
> > On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <takidau@google.com.invalid
> >
> > wrote:
> >
> > > Hello Beam, Calcite, and Flink dev lists!
> > >
> > > Apologies for the big cross post, but I thought this might be something
> > all
> > > three communities would find relevant.
> > >
> > > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks
> to
> > > Mingmin Xu. As you can imagine, we need to come to some conclusion
> about
> > > how to elegantly support the full suite of streaming functionality in
> the
> > > Beam model in via Calcite SQL. You folks in the Flink community have
> been
> > > pushing on this (e.g., adding windowing constructs, amongst others,
> thank
> > > you! :-), but from my understanding we still don't have a full spec for
> > how
> > > to support robust streaming in SQL (including but not limited to,
> e.g., a
> > > triggers analogue such as EMIT).
> > >
> > > I've been spending a lot of time thinking about this and have some
> > opinions
> > > about how I think it should look that I've already written down, so I
> > > volunteered to try to drive forward agreement on a general streaming
> SQL
> > > spec between our three communities (well, technically I volunteered to
> do
> > > that w/ Beam and Calcite, but I figured you Flink folks might want to
> > join
> > > in since you're going that direction already anyway and will have
> useful
> > > insights :-).
> > >
> > > My plan was to do this by sharing two docs:
> > >
> > >    1. The Beam Model : Streams & Tables - This one is for context, and
> > >    really only mentions SQL in passing. But it describes the
> relationship
> > >    between the Beam Model and the "streams & tables" way of thinking,
> > which
> > >    turns out to be useful in understanding what robust streaming in SQL
> > > might
> > >    look like. Many of you probably already know some or all of what's
> in
> > > here,
> > >    but I felt it was necessary to have it all written down in order to
> > > justify
> > >    some of the proposals I wanted to make in the second doc.
> > >
> > >    2. A streaming SQL spec for Calcite - The goal for this doc is that
> it
> > >    would become a general specification for what robust streaming SQL
> in
> > >    Calcite should look like. It would start out as a basic proposal of
> > what
> > >    things *could* look like (combining both what things look like now
> as
> > > well
> > >    as a set of proposed changes for the future), and we could all
> iterate
> > > on
> > >    it together until we get to something we're happy with.
> > >
> > > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > > figured I'd share it and let folks hack at it with comments if they
> have
> > > any, while I try to get the second doc ready in the meantime. As part
> of
> > > getting doc #2 ready, I'll be starting a separate thread to try to
> gather
> > > input on what things are already in flight for streaming SQL across the
> > > various communities, to make sure the proposal captures everything
> that's
> > > going on as accurately as it can.
> > >
> > > If you have any questions or comments, I'm interested to hear them.
> > > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> > >
> > >   http://s.apache.org/beam-streams-tables
> > >
> > > -Tyler
> > >
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Thanks for reading, Luke. The simple answer is that CoGBK is basically
flatten + GBK. Flatten is a non-grouping operation that merges the input
streams into a single output stream. GBK then groups the data within that
single union stream as you might otherwise expect, yielding a single table.
So I think it doesn't really impact things much. Grouping, aggregation,
window merging etc all just act upon the merged stream and generate what is
effectively a merged table.

-Tyler

On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> The doc is a good read.
>
> I think you do a great job of explaining table -> stream, stream -> stream,
> and stream -> table when there is only one stream.
> But when there are multiple streams reading/writing to a table, how does
> that impact what occurs?
> For example, with CoGBK you have multiple streams writing to a table, how
> does that impact window merging?
>
> On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Hello Beam, Calcite, and Flink dev lists!
> >
> > Apologies for the big cross post, but I thought this might be something
> all
> > three communities would find relevant.
> >
> > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> > Mingmin Xu. As you can imagine, we need to come to some conclusion about
> > how to elegantly support the full suite of streaming functionality in the
> > Beam model in via Calcite SQL. You folks in the Flink community have been
> > pushing on this (e.g., adding windowing constructs, amongst others, thank
> > you! :-), but from my understanding we still don't have a full spec for
> how
> > to support robust streaming in SQL (including but not limited to, e.g., a
> > triggers analogue such as EMIT).
> >
> > I've been spending a lot of time thinking about this and have some
> opinions
> > about how I think it should look that I've already written down, so I
> > volunteered to try to drive forward agreement on a general streaming SQL
> > spec between our three communities (well, technically I volunteered to do
> > that w/ Beam and Calcite, but I figured you Flink folks might want to
> join
> > in since you're going that direction already anyway and will have useful
> > insights :-).
> >
> > My plan was to do this by sharing two docs:
> >
> >    1. The Beam Model : Streams & Tables - This one is for context, and
> >    really only mentions SQL in passing. But it describes the relationship
> >    between the Beam Model and the "streams & tables" way of thinking,
> which
> >    turns out to be useful in understanding what robust streaming in SQL
> > might
> >    look like. Many of you probably already know some or all of what's in
> > here,
> >    but I felt it was necessary to have it all written down in order to
> > justify
> >    some of the proposals I wanted to make in the second doc.
> >
> >    2. A streaming SQL spec for Calcite - The goal for this doc is that it
> >    would become a general specification for what robust streaming SQL in
> >    Calcite should look like. It would start out as a basic proposal of
> what
> >    things *could* look like (combining both what things look like now as
> > well
> >    as a set of proposed changes for the future), and we could all iterate
> > on
> >    it together until we get to something we're happy with.
> >
> > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > figured I'd share it and let folks hack at it with comments if they have
> > any, while I try to get the second doc ready in the meantime. As part of
> > getting doc #2 ready, I'll be starting a separate thread to try to gather
> > input on what things are already in flight for streaming SQL across the
> > various communities, to make sure the proposal captures everything that's
> > going on as accurately as it can.
> >
> > If you have any questions or comments, I'm interested to hear them.
> > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >
> >   http://s.apache.org/beam-streams-tables
> >
> > -Tyler
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Thanks for reading, Luke. The simple answer is that CoGBK is basically
flatten + GBK. Flatten is a non-grouping operation that merges the input
streams into a single output stream. GBK then groups the data within that
single union stream as you might otherwise expect, yielding a single table.
So I think it doesn't really impact things much. Grouping, aggregation,
window merging etc all just act upon the merged stream and generate what is
effectively a merged table.

-Tyler

On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> The doc is a good read.
>
> I think you do a great job of explaining table -> stream, stream -> stream,
> and stream -> table when there is only one stream.
> But when there are multiple streams reading/writing to a table, how does
> that impact what occurs?
> For example, with CoGBK you have multiple streams writing to a table, how
> does that impact window merging?
>
> On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Hello Beam, Calcite, and Flink dev lists!
> >
> > Apologies for the big cross post, but I thought this might be something
> all
> > three communities would find relevant.
> >
> > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> > Mingmin Xu. As you can imagine, we need to come to some conclusion about
> > how to elegantly support the full suite of streaming functionality in the
> > Beam model in via Calcite SQL. You folks in the Flink community have been
> > pushing on this (e.g., adding windowing constructs, amongst others, thank
> > you! :-), but from my understanding we still don't have a full spec for
> how
> > to support robust streaming in SQL (including but not limited to, e.g., a
> > triggers analogue such as EMIT).
> >
> > I've been spending a lot of time thinking about this and have some
> opinions
> > about how I think it should look that I've already written down, so I
> > volunteered to try to drive forward agreement on a general streaming SQL
> > spec between our three communities (well, technically I volunteered to do
> > that w/ Beam and Calcite, but I figured you Flink folks might want to
> join
> > in since you're going that direction already anyway and will have useful
> > insights :-).
> >
> > My plan was to do this by sharing two docs:
> >
> >    1. The Beam Model : Streams & Tables - This one is for context, and
> >    really only mentions SQL in passing. But it describes the relationship
> >    between the Beam Model and the "streams & tables" way of thinking,
> which
> >    turns out to be useful in understanding what robust streaming in SQL
> > might
> >    look like. Many of you probably already know some or all of what's in
> > here,
> >    but I felt it was necessary to have it all written down in order to
> > justify
> >    some of the proposals I wanted to make in the second doc.
> >
> >    2. A streaming SQL spec for Calcite - The goal for this doc is that it
> >    would become a general specification for what robust streaming SQL in
> >    Calcite should look like. It would start out as a basic proposal of
> what
> >    things *could* look like (combining both what things look like now as
> > well
> >    as a set of proposed changes for the future), and we could all iterate
> > on
> >    it together until we get to something we're happy with.
> >
> > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > figured I'd share it and let folks hack at it with comments if they have
> > any, while I try to get the second doc ready in the meantime. As part of
> > getting doc #2 ready, I'll be starting a separate thread to try to gather
> > input on what things are already in flight for streaming SQL across the
> > various communities, to make sure the proposal captures everything that's
> > going on as accurately as it can.
> >
> > If you have any questions or comments, I'm interested to hear them.
> > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >
> >   http://s.apache.org/beam-streams-tables
> >
> > -Tyler
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Thanks for reading, Luke. The simple answer is that CoGBK is basically
flatten + GBK. Flatten is a non-grouping operation that merges the input
streams into a single output stream. GBK then groups the data within that
single union stream as you might otherwise expect, yielding a single table.
So I think it doesn't really impact things much. Grouping, aggregation,
window merging etc all just act upon the merged stream and generate what is
effectively a merged table.

-Tyler

On Fri, Apr 21, 2017 at 12:36 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> The doc is a good read.
>
> I think you do a great job of explaining table -> stream, stream -> stream,
> and stream -> table when there is only one stream.
> But when there are multiple streams reading/writing to a table, how does
> that impact what occurs?
> For example, with CoGBK you have multiple streams writing to a table, how
> does that impact window merging?
>
> On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
> > Hello Beam, Calcite, and Flink dev lists!
> >
> > Apologies for the big cross post, but I thought this might be something
> all
> > three communities would find relevant.
> >
> > Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> > Mingmin Xu. As you can imagine, we need to come to some conclusion about
> > how to elegantly support the full suite of streaming functionality in the
> > Beam model in via Calcite SQL. You folks in the Flink community have been
> > pushing on this (e.g., adding windowing constructs, amongst others, thank
> > you! :-), but from my understanding we still don't have a full spec for
> how
> > to support robust streaming in SQL (including but not limited to, e.g., a
> > triggers analogue such as EMIT).
> >
> > I've been spending a lot of time thinking about this and have some
> opinions
> > about how I think it should look that I've already written down, so I
> > volunteered to try to drive forward agreement on a general streaming SQL
> > spec between our three communities (well, technically I volunteered to do
> > that w/ Beam and Calcite, but I figured you Flink folks might want to
> join
> > in since you're going that direction already anyway and will have useful
> > insights :-).
> >
> > My plan was to do this by sharing two docs:
> >
> >    1. The Beam Model : Streams & Tables - This one is for context, and
> >    really only mentions SQL in passing. But it describes the relationship
> >    between the Beam Model and the "streams & tables" way of thinking,
> which
> >    turns out to be useful in understanding what robust streaming in SQL
> > might
> >    look like. Many of you probably already know some or all of what's in
> > here,
> >    but I felt it was necessary to have it all written down in order to
> > justify
> >    some of the proposals I wanted to make in the second doc.
> >
> >    2. A streaming SQL spec for Calcite - The goal for this doc is that it
> >    would become a general specification for what robust streaming SQL in
> >    Calcite should look like. It would start out as a basic proposal of
> what
> >    things *could* look like (combining both what things look like now as
> > well
> >    as a set of proposed changes for the future), and we could all iterate
> > on
> >    it together until we get to something we're happy with.
> >
> > At this point, I have doc #1 ready, and it's a bit of a monster, so I
> > figured I'd share it and let folks hack at it with comments if they have
> > any, while I try to get the second doc ready in the meantime. As part of
> > getting doc #2 ready, I'll be starting a separate thread to try to gather
> > input on what things are already in flight for streaming SQL across the
> > various communities, to make sure the proposal captures everything that's
> > going on as accurately as it can.
> >
> > If you have any questions or comments, I'm interested to hear them.
> > Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
> >
> >   http://s.apache.org/beam-streams-tables
> >
> > -Tyler
> >
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

The doc is a good read.

I think you do a great job of explaining table -> stream, stream -> stream,
and stream -> table when there is only one stream.
But when there are multiple streams reading/writing to a table, how does
that impact what occurs?
For example, with CoGBK you have multiple streams writing to a table, how
does that impact window merging?

On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Hello Beam, Calcite, and Flink dev lists!
>
> Apologies for the big cross post, but I thought this might be something all
> three communities would find relevant.
>
> Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> Mingmin Xu. As you can imagine, we need to come to some conclusion about
> how to elegantly support the full suite of streaming functionality in the
> Beam model in via Calcite SQL. You folks in the Flink community have been
> pushing on this (e.g., adding windowing constructs, amongst others, thank
> you! :-), but from my understanding we still don't have a full spec for how
> to support robust streaming in SQL (including but not limited to, e.g., a
> triggers analogue such as EMIT).
>
> I've been spending a lot of time thinking about this and have some opinions
> about how I think it should look that I've already written down, so I
> volunteered to try to drive forward agreement on a general streaming SQL
> spec between our three communities (well, technically I volunteered to do
> that w/ Beam and Calcite, but I figured you Flink folks might want to join
> in since you're going that direction already anyway and will have useful
> insights :-).
>
> My plan was to do this by sharing two docs:
>
>    1. The Beam Model : Streams & Tables - This one is for context, and
>    really only mentions SQL in passing. But it describes the relationship
>    between the Beam Model and the "streams & tables" way of thinking, which
>    turns out to be useful in understanding what robust streaming in SQL
> might
>    look like. Many of you probably already know some or all of what's in
> here,
>    but I felt it was necessary to have it all written down in order to
> justify
>    some of the proposals I wanted to make in the second doc.
>
>    2. A streaming SQL spec for Calcite - The goal for this doc is that it
>    would become a general specification for what robust streaming SQL in
>    Calcite should look like. It would start out as a basic proposal of what
>    things *could* look like (combining both what things look like now as
> well
>    as a set of proposed changes for the future), and we could all iterate
> on
>    it together until we get to something we're happy with.
>
> At this point, I have doc #1 ready, and it's a bit of a monster, so I
> figured I'd share it and let folks hack at it with comments if they have
> any, while I try to get the second doc ready in the meantime. As part of
> getting doc #2 ready, I'll be starting a separate thread to try to gather
> input on what things are already in flight for streaming SQL across the
> various communities, to make sure the proposal captures everything that's
> going on as accurately as it can.
>
> If you have any questions or comments, I'm interested to hear them.
> Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>
>   http://s.apache.org/beam-streams-tables
>
> -Tyler
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

The doc is a good read.

I think you do a great job of explaining table -> stream, stream -> stream,
and stream -> table when there is only one stream.
But when there are multiple streams reading/writing to a table, how does
that impact what occurs?
For example, with CoGBK you have multiple streams writing to a table, how
does that impact window merging?

On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Hello Beam, Calcite, and Flink dev lists!
>
> Apologies for the big cross post, but I thought this might be something all
> three communities would find relevant.
>
> Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> Mingmin Xu. As you can imagine, we need to come to some conclusion about
> how to elegantly support the full suite of streaming functionality in the
> Beam model in via Calcite SQL. You folks in the Flink community have been
> pushing on this (e.g., adding windowing constructs, amongst others, thank
> you! :-), but from my understanding we still don't have a full spec for how
> to support robust streaming in SQL (including but not limited to, e.g., a
> triggers analogue such as EMIT).
>
> I've been spending a lot of time thinking about this and have some opinions
> about how I think it should look that I've already written down, so I
> volunteered to try to drive forward agreement on a general streaming SQL
> spec between our three communities (well, technically I volunteered to do
> that w/ Beam and Calcite, but I figured you Flink folks might want to join
> in since you're going that direction already anyway and will have useful
> insights :-).
>
> My plan was to do this by sharing two docs:
>
>    1. The Beam Model : Streams & Tables - This one is for context, and
>    really only mentions SQL in passing. But it describes the relationship
>    between the Beam Model and the "streams & tables" way of thinking, which
>    turns out to be useful in understanding what robust streaming in SQL
> might
>    look like. Many of you probably already know some or all of what's in
> here,
>    but I felt it was necessary to have it all written down in order to
> justify
>    some of the proposals I wanted to make in the second doc.
>
>    2. A streaming SQL spec for Calcite - The goal for this doc is that it
>    would become a general specification for what robust streaming SQL in
>    Calcite should look like. It would start out as a basic proposal of what
>    things *could* look like (combining both what things look like now as
> well
>    as a set of proposed changes for the future), and we could all iterate
> on
>    it together until we get to something we're happy with.
>
> At this point, I have doc #1 ready, and it's a bit of a monster, so I
> figured I'd share it and let folks hack at it with comments if they have
> any, while I try to get the second doc ready in the meantime. As part of
> getting doc #2 ready, I'll be starting a separate thread to try to gather
> input on what things are already in flight for streaming SQL across the
> various communities, to make sure the proposal captures everything that's
> going on as accurately as it can.
>
> If you have any questions or comments, I'm interested to hear them.
> Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>
>   http://s.apache.org/beam-streams-tables
>
> -Tyler
>

Re: Towards a spec for robust streaming SQL, Part 1

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

The doc is a good read.

I think you do a great job of explaining table -> stream, stream -> stream,
and stream -> table when there is only one stream.
But when there are multiple streams reading/writing to a table, how does
that impact what occurs?
For example, with CoGBK you have multiple streams writing to a table, how
does that impact window merging?

On Thu, Apr 20, 2017 at 5:57 PM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Hello Beam, Calcite, and Flink dev lists!
>
> Apologies for the big cross post, but I thought this might be something all
> three communities would find relevant.
>
> Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
> Mingmin Xu. As you can imagine, we need to come to some conclusion about
> how to elegantly support the full suite of streaming functionality in the
> Beam model in via Calcite SQL. You folks in the Flink community have been
> pushing on this (e.g., adding windowing constructs, amongst others, thank
> you! :-), but from my understanding we still don't have a full spec for how
> to support robust streaming in SQL (including but not limited to, e.g., a
> triggers analogue such as EMIT).
>
> I've been spending a lot of time thinking about this and have some opinions
> about how I think it should look that I've already written down, so I
> volunteered to try to drive forward agreement on a general streaming SQL
> spec between our three communities (well, technically I volunteered to do
> that w/ Beam and Calcite, but I figured you Flink folks might want to join
> in since you're going that direction already anyway and will have useful
> insights :-).
>
> My plan was to do this by sharing two docs:
>
>    1. The Beam Model : Streams & Tables - This one is for context, and
>    really only mentions SQL in passing. But it describes the relationship
>    between the Beam Model and the "streams & tables" way of thinking, which
>    turns out to be useful in understanding what robust streaming in SQL
> might
>    look like. Many of you probably already know some or all of what's in
> here,
>    but I felt it was necessary to have it all written down in order to
> justify
>    some of the proposals I wanted to make in the second doc.
>
>    2. A streaming SQL spec for Calcite - The goal for this doc is that it
>    would become a general specification for what robust streaming SQL in
>    Calcite should look like. It would start out as a basic proposal of what
>    things *could* look like (combining both what things look like now as
> well
>    as a set of proposed changes for the future), and we could all iterate
> on
>    it together until we get to something we're happy with.
>
> At this point, I have doc #1 ready, and it's a bit of a monster, so I
> figured I'd share it and let folks hack at it with comments if they have
> any, while I try to get the second doc ready in the meantime. As part of
> getting doc #2 ready, I'll be starting a separate thread to try to gather
> input on what things are already in flight for streaming SQL across the
> various communities, to make sure the proposal captures everything that's
> going on as accurately as it can.
>
> If you have any questions or comments, I'm interested to hear them.
> Otherwise, here's doc #1, "The Beam Model : Streams & Tables":
>
>   http://s.apache.org/beam-streams-tables
>
> -Tyler
>