You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kenneth Knowles <ke...@apache.org> on 2019/07/18 17:51:09 UTC

[PROPOSAL] Revised streaming extensions for Beam SQL

Hi all,

I recently had the great privilege to work with others from Beam plus
Calcite and Flink SQL contributors to build a new and minimal proposal for
adding streaming extensions to standard SQL: event time, watermarks,
windowing, triggers, stream materialization.

We hope this will influence the standard body and also Calcite and Flink
and other projects working on the streaming SQL.

I would like to start implementing these extensions in Beam, moving from
our current streaming extensions to the new proposal.

   The whole paper is https://arxiv.org/abs/1905.12133

   My small proposal to start in Beam:
https://s.apache.org/streaming-beam-sql

TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that do
Tumble, Hop, Session. The details of why to make this change are explained
in the appendix to my proposal. For the big picture of how it fits in, the
full paper is best.

Kenn

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

Posted by jincheng sun <su...@gmail.com>.
Thanks for bring up this discussion Kenn!

Definitely +1 for the proposal.

I have left some questions in the documentation :)

Best,
Jincheng

Rui Wang <ru...@google.com> 于2019年12月11日周三 上午5:23写道:

> Until now as I am not seeing more people are commenting on this proposal,
> can we consider this proposal is already accepted by Beam community?
>
> If it is accepted, I want to start a discussion on deprecate the old GROUP
> BY windowing style and only keep table-valued function windowing.
>
>
> -Rui
>
> On Thu, Jul 25, 2019 at 11:32 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> We hope it does enter the SQL standard. It is one reason for coming
>> together to write this paper.
>>
>> OVER clause is mentioned often.
>>
>>  - TUMBLE can actually just be a function so you don't need OVER or any
>> of the fancy stuff we propose; it is just done to make them all look similar
>>  - HOP still doesn't work since OVER clause has one value per input row,
>> it is still 1 to 1 input/output ratio
>>  - SESSION GAP 5 MINUTES (PARTITION BY key) is actually a natural syntax
>> that could work well
>>
>> None of them require ORDER, by design.
>>
>> On the other hand, implementing the general OVER clause and the rank,
>> running sum, etc, could be done with GBK + sort values. That is not related
>> to windowing. And since in SQL users of windowing will think of OVER as
>> related to ordering, I personally don't want to also use it for something
>> that has nothing to do with ordering.
>>
>> But if you would write up something that could be interesting to discuss
>> more.
>>
>> Kenn
>>
>> On Wed, Jul 24, 2019 at 2:24 PM Mingmin Xu <mi...@gmail.com> wrote:
>>
>>> +1 to remove those magic words in Calcite streaming SQL, just because
>>> they're not SQL standard. The idea to replace HOP/TUMBLE with
>>> table-view-functions makes it concise, my only question is, is it(or will
>>> it be) part of SQL standard? --I'm a big fan to align with standards :lol
>>>
>>> Ps, although the concept of `window` used here are different from window
>>> function in SQL, the syntax gives some insight. Take the example of `ROW_NUMBER()
>>> OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
>>> assigns a sequence value for records in subgroup with key 'COL1'. We can
>>> introduce another function, like TUMBLE() which will assign a window
>>> instance(more instances for HOP()) for the record.
>>>
>>> Mingmin
>>>
>>>
>>> On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang <ow...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Kenn,
>>>> great paper and left some newbie questions on the proposal.
>>>>
>>>> Manu
>>>>
>>>> On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I recently had the great privilege to work with others from Beam plus
>>>>> Calcite and Flink SQL contributors to build a new and minimal proposal for
>>>>> adding streaming extensions to standard SQL: event time, watermarks,
>>>>> windowing, triggers, stream materialization.
>>>>>
>>>>> We hope this will influence the standard body and also Calcite and
>>>>> Flink and other projects working on the streaming SQL.
>>>>>
>>>>> I would like to start implementing these extensions in Beam, moving
>>>>> from our current streaming extensions to the new proposal.
>>>>>
>>>>>    The whole paper is https://arxiv.org/abs/1905.12133
>>>>>
>>>>>    My small proposal to start in Beam:
>>>>> https://s.apache.org/streaming-beam-sql
>>>>>
>>>>> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that
>>>>> do Tumble, Hop, Session. The details of why to make this change are
>>>>> explained in the appendix to my proposal. For the big picture of how it
>>>>> fits in, the full paper is best.
>>>>>
>>>>> Kenn
>>>>>
>>>>
>>>
>>> --
>>> ----
>>> Mingmin
>>>
>>

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

Posted by Rui Wang <ru...@google.com>.
Until now as I am not seeing more people are commenting on this proposal,
can we consider this proposal is already accepted by Beam community?

If it is accepted, I want to start a discussion on deprecate the old GROUP
BY windowing style and only keep table-valued function windowing.


-Rui

On Thu, Jul 25, 2019 at 11:32 AM Kenneth Knowles <ke...@apache.org> wrote:

> We hope it does enter the SQL standard. It is one reason for coming
> together to write this paper.
>
> OVER clause is mentioned often.
>
>  - TUMBLE can actually just be a function so you don't need OVER or any of
> the fancy stuff we propose; it is just done to make them all look similar
>  - HOP still doesn't work since OVER clause has one value per input row,
> it is still 1 to 1 input/output ratio
>  - SESSION GAP 5 MINUTES (PARTITION BY key) is actually a natural syntax
> that could work well
>
> None of them require ORDER, by design.
>
> On the other hand, implementing the general OVER clause and the rank,
> running sum, etc, could be done with GBK + sort values. That is not related
> to windowing. And since in SQL users of windowing will think of OVER as
> related to ordering, I personally don't want to also use it for something
> that has nothing to do with ordering.
>
> But if you would write up something that could be interesting to discuss
> more.
>
> Kenn
>
> On Wed, Jul 24, 2019 at 2:24 PM Mingmin Xu <mi...@gmail.com> wrote:
>
>> +1 to remove those magic words in Calcite streaming SQL, just because
>> they're not SQL standard. The idea to replace HOP/TUMBLE with
>> table-view-functions makes it concise, my only question is, is it(or will
>> it be) part of SQL standard? --I'm a big fan to align with standards :lol
>>
>> Ps, although the concept of `window` used here are different from window
>> function in SQL, the syntax gives some insight. Take the example of `ROW_NUMBER()
>> OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
>> assigns a sequence value for records in subgroup with key 'COL1'. We can
>> introduce another function, like TUMBLE() which will assign a window
>> instance(more instances for HOP()) for the record.
>>
>> Mingmin
>>
>>
>> On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang <ow...@gmail.com>
>> wrote:
>>
>>> Thanks Kenn,
>>> great paper and left some newbie questions on the proposal.
>>>
>>> Manu
>>>
>>> On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I recently had the great privilege to work with others from Beam plus
>>>> Calcite and Flink SQL contributors to build a new and minimal proposal for
>>>> adding streaming extensions to standard SQL: event time, watermarks,
>>>> windowing, triggers, stream materialization.
>>>>
>>>> We hope this will influence the standard body and also Calcite and
>>>> Flink and other projects working on the streaming SQL.
>>>>
>>>> I would like to start implementing these extensions in Beam, moving
>>>> from our current streaming extensions to the new proposal.
>>>>
>>>>    The whole paper is https://arxiv.org/abs/1905.12133
>>>>
>>>>    My small proposal to start in Beam:
>>>> https://s.apache.org/streaming-beam-sql
>>>>
>>>> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that
>>>> do Tumble, Hop, Session. The details of why to make this change are
>>>> explained in the appendix to my proposal. For the big picture of how it
>>>> fits in, the full paper is best.
>>>>
>>>> Kenn
>>>>
>>>
>>
>> --
>> ----
>> Mingmin
>>
>

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

Posted by Kenneth Knowles <ke...@apache.org>.
We hope it does enter the SQL standard. It is one reason for coming
together to write this paper.

OVER clause is mentioned often.

 - TUMBLE can actually just be a function so you don't need OVER or any of
the fancy stuff we propose; it is just done to make them all look similar
 - HOP still doesn't work since OVER clause has one value per input row, it
is still 1 to 1 input/output ratio
 - SESSION GAP 5 MINUTES (PARTITION BY key) is actually a natural syntax
that could work well

None of them require ORDER, by design.

On the other hand, implementing the general OVER clause and the rank,
running sum, etc, could be done with GBK + sort values. That is not related
to windowing. And since in SQL users of windowing will think of OVER as
related to ordering, I personally don't want to also use it for something
that has nothing to do with ordering.

But if you would write up something that could be interesting to discuss
more.

Kenn

On Wed, Jul 24, 2019 at 2:24 PM Mingmin Xu <mi...@gmail.com> wrote:

> +1 to remove those magic words in Calcite streaming SQL, just because
> they're not SQL standard. The idea to replace HOP/TUMBLE with
> table-view-functions makes it concise, my only question is, is it(or will
> it be) part of SQL standard? --I'm a big fan to align with standards :lol
>
> Ps, although the concept of `window` used here are different from window
> function in SQL, the syntax gives some insight. Take the example of `ROW_NUMBER()
> OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
> assigns a sequence value for records in subgroup with key 'COL1'. We can
> introduce another function, like TUMBLE() which will assign a window
> instance(more instances for HOP()) for the record.
>
> Mingmin
>
>
> On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang <ow...@gmail.com>
> wrote:
>
>> Thanks Kenn,
>> great paper and left some newbie questions on the proposal.
>>
>> Manu
>>
>> On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> I recently had the great privilege to work with others from Beam plus
>>> Calcite and Flink SQL contributors to build a new and minimal proposal for
>>> adding streaming extensions to standard SQL: event time, watermarks,
>>> windowing, triggers, stream materialization.
>>>
>>> We hope this will influence the standard body and also Calcite and Flink
>>> and other projects working on the streaming SQL.
>>>
>>> I would like to start implementing these extensions in Beam, moving from
>>> our current streaming extensions to the new proposal.
>>>
>>>    The whole paper is https://arxiv.org/abs/1905.12133
>>>
>>>    My small proposal to start in Beam:
>>> https://s.apache.org/streaming-beam-sql
>>>
>>> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that
>>> do Tumble, Hop, Session. The details of why to make this change are
>>> explained in the appendix to my proposal. For the big picture of how it
>>> fits in, the full paper is best.
>>>
>>> Kenn
>>>
>>
>
> --
> ----
> Mingmin
>

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

Posted by Mingmin Xu <mi...@gmail.com>.
+1 to remove those magic words in Calcite streaming SQL, just because
they're not SQL standard. The idea to replace HOP/TUMBLE with
table-view-functions makes it concise, my only question is, is it(or will
it be) part of SQL standard? --I'm a big fan to align with standards :lol

Ps, although the concept of `window` used here are different from window
function in SQL, the syntax gives some insight. Take the example of
`ROW_NUMBER()
OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
assigns a sequence value for records in subgroup with key 'COL1'. We can
introduce another function, like TUMBLE() which will assign a window
instance(more instances for HOP()) for the record.

Mingmin


On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang <ow...@gmail.com> wrote:

> Thanks Kenn,
> great paper and left some newbie questions on the proposal.
>
> Manu
>
> On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Hi all,
>>
>> I recently had the great privilege to work with others from Beam plus
>> Calcite and Flink SQL contributors to build a new and minimal proposal for
>> adding streaming extensions to standard SQL: event time, watermarks,
>> windowing, triggers, stream materialization.
>>
>> We hope this will influence the standard body and also Calcite and Flink
>> and other projects working on the streaming SQL.
>>
>> I would like to start implementing these extensions in Beam, moving from
>> our current streaming extensions to the new proposal.
>>
>>    The whole paper is https://arxiv.org/abs/1905.12133
>>
>>    My small proposal to start in Beam:
>> https://s.apache.org/streaming-beam-sql
>>
>> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that do
>> Tumble, Hop, Session. The details of why to make this change are explained
>> in the appendix to my proposal. For the big picture of how it fits in, the
>> full paper is best.
>>
>> Kenn
>>
>

-- 
----
Mingmin

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

Posted by Manu Zhang <ow...@gmail.com>.
Thanks Kenn,
great paper and left some newbie questions on the proposal.

Manu

On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles <ke...@apache.org> wrote:

> Hi all,
>
> I recently had the great privilege to work with others from Beam plus
> Calcite and Flink SQL contributors to build a new and minimal proposal for
> adding streaming extensions to standard SQL: event time, watermarks,
> windowing, triggers, stream materialization.
>
> We hope this will influence the standard body and also Calcite and Flink
> and other projects working on the streaming SQL.
>
> I would like to start implementing these extensions in Beam, moving from
> our current streaming extensions to the new proposal.
>
>    The whole paper is https://arxiv.org/abs/1905.12133
>
>    My small proposal to start in Beam:
> https://s.apache.org/streaming-beam-sql
>
> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that do
> Tumble, Hop, Session. The details of why to make this change are explained
> in the appendix to my proposal. For the big picture of how it fits in, the
> full paper is best.
>
> Kenn
>