You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Danny Chan <yu...@gmail.com> on 2020/09/23 04:05:01 UTC

[ DISCUSS] The window table functions join syntax

Hi, fellows, long time no see on the mailing ~

Here I want to have a discussion on the join syntax of our recently introduced window table function ~

For example, we can define a tumbling window function of 5 minutes size as:

Tumble(table T, descriptor(T.ts), INTERVAL ‘5’ MINUTE)

The we can select from it, and moreover, I want to support 2 window function join for the streaming query recently.

The semantics of the windowed stream join is:

• The 2 window inputs should have the same window arguments (except for the table name), e.g. for TUMBLE the size should be equal, for HOP, both the side interval and size should be equal
• We first window the input stream then join the both window data set of the same TimeWindow
• The Join action is triggered by the watermark of the stream
• The join does not produce retractions of the stream, the mainly difference with normal two-stream join

And I want to propose a join syntax as:

Select L.f0, R.f2, L.window_start, L.window_end
FROM
Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
JOIN
Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
ON
L.f0 = R.f0 AND L.window_start = R.window_start AND L.window_end = R.window_end

The red syntax part is what I want to discuss, the condition seems too verbose because
user need to declare it every time.


• Should we make it optional ?
• Is there better syntax to describe this window join semantics ?


Best,
Danny Chan

Re: [ DISCUSS] The window table functions join syntax

Posted by Rui Wang <am...@apache.org>.
Thanks Villiam for pointing this out! So the standard still solves
duplicate names in "TUMBLE(TUMBLE())" by alias, which makes sense. Then we
can evaluate whether Calcite supports alias for table function in a correct
way in CALCITE-4274 <https://issues.apache.org/jira/browse/CALCITE-4274>.


-Rui

On Thu, Sep 24, 2020 at 3:12 AM Viliam Durina <vi...@hazelcast.com> wrote:

> I think we're discussing two issues. One is "TUMBLE join TUMBLE" and the
> other is "TUMBLE(TUMBLE())". In both cases the `window_start` and
> `window_end` columns are duplicated.
>
> In case of JOIN, I don't see any issue. It's equivalent to any join of two
> tables, e.g.:
>
>   SELECT *
>   FROM t1
>   JOIN t2 on t1.id=t2.id
>
> You'll have two ID columns in the result, no issue here.
>
> In case of "TUMBLE(TUMBLE())", I'd point to the feature of polymorphic
> table functions in the standard:
>
>   CREATE TABLE input_table(window_start, window_end, timestamp);
>
>   SELECT T.window_start, T.window_end, T.timestamp, F.window_start,
> F.window_end
>   FROM TABLE(TUMBLE(TABLE(input_table) AS T, DESCRIPTOR(timestamp),
> INTERVAL '5' MINUTES)) AS F
>
> As you can see, you can refer to the duplicate columns using a different
> alias: `T` alias is for the column in the input_table, `F` alias is for the
> columns added by the function.
>
> Viliam
>
> On Thu, 24 Sep 2020 at 00:21, Kenneth Knowles <ke...@apache.org> wrote:
>
> > It happens in "raw Beam" somewhat often that there is one windowed
> > aggregation followed by another. It is pretty unheard of to "assign"
> > windows twice before an aggregation, because Beam cannot do anything
> useful
> > with this: Beam does not have the ability to have two different
> > fields/columns that are both associated with a watermark. So TUMBLE
> > followed by TUMBLE is more expressive. You get a number of columns all of
> > which can be used for windowed aggregation.
> >
> > One reason this could happen is that you are authoring a table function
> > that operates generically. It does not know that its input has already
> been
> > TUMBLED. In Beam you would reify the prior window information, do your
> own
> > TUMBLE. Likely you would do an aggregation and then restore the prior
> > window information somehow. A compelling example does not come to mind,
> but
> > I think considering applying TUMBLE to a generic input helps clarify that
> > it could happen in code worth allowing.
> >
> > Kenn
> >
> > On Wed, Sep 23, 2020 at 11:47 AM Julian Hyde <jh...@apache.org> wrote:
> >
> > > I don't think it's a problem with table functions in general. And
> > > besides, we can't change the semantics of table functions. A table
> > > function must not produce duplicate column names.
> > >
> > > The problem is with the semantics of these particular table functions
> > > - HOP, TUMBLE, SESSION - and what semantics are desirable depends on
> > > how people will typically use them. Is it common to follow TUMBLE with
> > > TUMBLE? What would a user expect to be the output columns?
> > >
> > > On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <am...@apache.org>
> wrote:
> > > >
> > > > >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > > > generally want two sets of window_start, window_end columns?
> > > >
> > > > I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to
> > HOP
> > > > join, as long as there is a real requirement there. The window
> > > starts/ends
> > > > are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
> > > > window start and a window end, same for R. This is no different from
> a
> > > > normal JOIN case where both JOIN sides have the same column names
> (but
> > > they
> > > > are not considered duplicates). The SQL rule is still applying:
> within
> > a
> > > > scope there shouldn't be ambiguous column names (e.g. duplicated
> column
> > > > name). For JOIN duplicate names from JOIN inputs are differentiated
> by
> > > > table alias.
> > > >
> > > >
> > > > Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this
> is
> > an
> > > > interesting case that is different from the JOIN case, and I also
> think
> > > > this is a general case (not limited to TUMBLE).
> > > >
> > > > Think about that for any query that uses table function of the
> pattern
> > > > in CALCITE-4274. The first table function generates column A and then
> > it
> > > > becomes the input for the second table function, which also wants to
> > > append
> > > > a column named "A". How should Calcite handle this case?
> > > >
> > > >
> > > > -Rui
> > > >
> > > >
> > > >
> > > > On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org>
> wrote:
> > > >
> > > > > I think we should also discuss
> > > > > https://issues.apache.org/jira/browse/CALCITE-4274 here.
> > > > >
> > > > > We've never discussed what should happen if you apply TUMBLE to
> > TUMBLE
> > > > > (or TUMBLE to HOP, etc.). What happens now is that you get
> duplicate
> > > > > columns.
> > > > >
> > > > > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > > > generally want two sets of window_start, window_end columns?
> > > > >
> > > > > Julian
> > > > >
> > > > > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Thanks for the feedback, I agree we should keep the verbose part
> > > > > >
> > > > > > **L.window_start = R.window_start AND L.window_end
> =R.window_end**
> > > > > >
> > > > > > Which would make the semantic more clear ~
> > > > > >
> > > > > > Best,
> > > > > > Danny Chan
> > > > > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <viliam@hazelcast.com
> >,写道:
> > > > > > > You can also use
> > > > > > >
> > > > > > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > > > > > FROM
> > > > > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > > > > > JOIN
> > > > > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > > > > > USING (f0, window_start)
> > > > > > >
> > > > > > > Viliam
> > > > > > >
> > > > > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > Regarding to **L.window_start = R.window_start AND
> > L.window_end =
> > > > > > > > R.window_end**:
> > > > > > > >
> > > > > > > > In general, the current table function windowing model is to
> > > append
> > > > > window
> > > > > > > > metadata to table directly, thus window metadata becomes a
> part
> > > of
> > > > > table
> > > > > > > > (or call it data). So as a part of table, these two columns
> > > should be
> > > > > > > > treated as normal columns thus they should be in the join on
> > > > > condition.
> > > > > > > >
> > > > > > > > If you want to make it optional, it makes window start/end
> > > columns
> > > > > special
> > > > > > > > and has a semantic binding with special table functions
> > (TUMBLE,
> > > HOP,
> > > > > > > > SESSION), which then becomes really not a SQL thing. For
> > > example, we
> > > > > can
> > > > > > > > allow users to define their own windowing table function. In
> > that
> > > > > case, how
> > > > > > > > will you utilize window start/end produced by a customized
> > > windowing
> > > > > table
> > > > > > > > function? What if users produce wired windows that have
> > > overlapped
> > > > > window
> > > > > > > > starts or window ends?
> > > > > > > >
> > > > > > > > Keeping windows start/end as a part of the table, treating
> them
> > > no
> > > > > > > > different from other columns, could give a consistent
> behavior
> > > for
> > > > > either
> > > > > > > > built-in table function or user-defined table function.
> > > > > > > >
> > > > > > > > If you think it is too verbose, there are two options to
> > > optimize:
> > > > > > > >
> > > > > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you
> > will
> > > > > only need
> > > > > > > > either window start or end, so you can simplify it, for
> > example,
> > > to
> > > > > > > > L.window_start = R.window_start only.
> > > > > > > > 2. (not recommended), you can cut off **L.window_start =
> > > > > R.window_start AND
> > > > > > > > L.window_end = R.window_end**, but add window metadata
> > > comparison to
> > > > > join
> > > > > > > > implicitly by execution engine. E.g. you can make up the join
> > > > > condition in
> > > > > > > > your JoinRel if two inputs are TUMBLE.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > -Rui
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <
> > > yuzhao.cyz@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Yes, the red part is **L.window_start = R.window_start AND
> > > > > L.window_end =
> > > > > > > > > R.window_end**
> > > > > > > > >
> > > > > > > > > > Is this a limitation for "triggered by the watermark of
> the
> > > > > stream”?
> > > > > > > > >
> > > > > > > > > No, because in most of the cases, there is no need to
> output
> > > the
> > > > > > > > > intermediate/partial join records then send retractions.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > So, how do you think about the condition syntax
> > > **L.window_start =
> > > > > > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Danny Chan
> > > > > > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > > > > > >
> > > > > > > > > > L.window_start = R.window_start AND L.window_end =
> > > R.window_end
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Viliam Durina
> > > > > > > Jet Developer
> > > > > > > hazelcast®
> > > > > > >
> > > > > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo,
> CA
> > > > > 94402 |
> > > > > > > USA
> > > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453>
> <(650)%20521-5453> |
> > > hazelcast.com <
> > > > > https://www.hazelcast.com>
> > > > > > >
> > > > > > > --
> > > > > > > This message contains confidential information and is intended
> > only
> > > > > for the
> > > > > > > individuals named. If you are not the named addressee you
> should
> > > not
> > > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > > sender
> > > > > > > immediately by e-mail if you have received this e-mail by
> mistake
> > > and
> > > > > > > delete this e-mail from your system. E-mail transmission cannot
> > be
> > > > > > > guaranteed to be secure or error-free as information could be
> > > > > intercepted,
> > > > > > > corrupted, lost, destroyed, arrive late or incomplete, or
> contain
> > > > > viruses.
> > > > > > > The sender therefore does not accept liability for any errors
> or
> > > > > omissions
> > > > > > > in the contents of this message, which arise as a result of
> > e-mail
> > > > > > > transmission. If verification is required, please request a
> > > hard-copy
> > > > > > > version. -Hazelcast
> > > > >
> > >
> >
>
>
> --
> Viliam Durina
> Jet Developer
>       hazelcast®
>
>   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
> USA
> +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com <
> https://www.hazelcast.com>
>
> --
> This message contains confidential information and is intended only for
> the
> individuals named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. If verification is required, please request a hard-copy
> version. -Hazelcast
>

Re: [ DISCUSS] The window table functions join syntax

Posted by Viliam Durina <vi...@hazelcast.com>.
I think we're discussing two issues. One is "TUMBLE join TUMBLE" and the
other is "TUMBLE(TUMBLE())". In both cases the `window_start` and
`window_end` columns are duplicated.

In case of JOIN, I don't see any issue. It's equivalent to any join of two
tables, e.g.:

  SELECT *
  FROM t1
  JOIN t2 on t1.id=t2.id

You'll have two ID columns in the result, no issue here.

In case of "TUMBLE(TUMBLE())", I'd point to the feature of polymorphic
table functions in the standard:

  CREATE TABLE input_table(window_start, window_end, timestamp);

  SELECT T.window_start, T.window_end, T.timestamp, F.window_start,
F.window_end
  FROM TABLE(TUMBLE(TABLE(input_table) AS T, DESCRIPTOR(timestamp),
INTERVAL '5' MINUTES)) AS F

As you can see, you can refer to the duplicate columns using a different
alias: `T` alias is for the column in the input_table, `F` alias is for the
columns added by the function.

Viliam

On Thu, 24 Sep 2020 at 00:21, Kenneth Knowles <ke...@apache.org> wrote:

> It happens in "raw Beam" somewhat often that there is one windowed
> aggregation followed by another. It is pretty unheard of to "assign"
> windows twice before an aggregation, because Beam cannot do anything useful
> with this: Beam does not have the ability to have two different
> fields/columns that are both associated with a watermark. So TUMBLE
> followed by TUMBLE is more expressive. You get a number of columns all of
> which can be used for windowed aggregation.
>
> One reason this could happen is that you are authoring a table function
> that operates generically. It does not know that its input has already been
> TUMBLED. In Beam you would reify the prior window information, do your own
> TUMBLE. Likely you would do an aggregation and then restore the prior
> window information somehow. A compelling example does not come to mind, but
> I think considering applying TUMBLE to a generic input helps clarify that
> it could happen in code worth allowing.
>
> Kenn
>
> On Wed, Sep 23, 2020 at 11:47 AM Julian Hyde <jh...@apache.org> wrote:
>
> > I don't think it's a problem with table functions in general. And
> > besides, we can't change the semantics of table functions. A table
> > function must not produce duplicate column names.
> >
> > The problem is with the semantics of these particular table functions
> > - HOP, TUMBLE, SESSION - and what semantics are desirable depends on
> > how people will typically use them. Is it common to follow TUMBLE with
> > TUMBLE? What would a user expect to be the output columns?
> >
> > On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <am...@apache.org> wrote:
> > >
> > > >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > > generally want two sets of window_start, window_end columns?
> > >
> > > I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to
> HOP
> > > join, as long as there is a real requirement there. The window
> > starts/ends
> > > are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
> > > window start and a window end, same for R. This is no different from a
> > > normal JOIN case where both JOIN sides have the same column names (but
> > they
> > > are not considered duplicates). The SQL rule is still applying: within
> a
> > > scope there shouldn't be ambiguous column names (e.g. duplicated column
> > > name). For JOIN duplicate names from JOIN inputs are differentiated by
> > > table alias.
> > >
> > >
> > > Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this is
> an
> > > interesting case that is different from the JOIN case, and I also think
> > > this is a general case (not limited to TUMBLE).
> > >
> > > Think about that for any query that uses table function of the pattern
> > > in CALCITE-4274. The first table function generates column A and then
> it
> > > becomes the input for the second table function, which also wants to
> > append
> > > a column named "A". How should Calcite handle this case?
> > >
> > >
> > > -Rui
> > >
> > >
> > >
> > > On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org> wrote:
> > >
> > > > I think we should also discuss
> > > > https://issues.apache.org/jira/browse/CALCITE-4274 here.
> > > >
> > > > We've never discussed what should happen if you apply TUMBLE to
> TUMBLE
> > > > (or TUMBLE to HOP, etc.). What happens now is that you get duplicate
> > > > columns.
> > > >
> > > > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > > generally want two sets of window_start, window_end columns?
> > > >
> > > > Julian
> > > >
> > > > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > Thanks for the feedback, I agree we should keep the verbose part
> > > > >
> > > > > **L.window_start = R.window_start AND L.window_end =R.window_end**
> > > > >
> > > > > Which would make the semantic more clear ~
> > > > >
> > > > > Best,
> > > > > Danny Chan
> > > > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> > > > > > You can also use
> > > > > >
> > > > > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > > > > FROM
> > > > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > > > > JOIN
> > > > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > > > > USING (f0, window_start)
> > > > > >
> > > > > > Viliam
> > > > > >
> > > > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Regarding to **L.window_start = R.window_start AND
> L.window_end =
> > > > > > > R.window_end**:
> > > > > > >
> > > > > > > In general, the current table function windowing model is to
> > append
> > > > window
> > > > > > > metadata to table directly, thus window metadata becomes a part
> > of
> > > > table
> > > > > > > (or call it data). So as a part of table, these two columns
> > should be
> > > > > > > treated as normal columns thus they should be in the join on
> > > > condition.
> > > > > > >
> > > > > > > If you want to make it optional, it makes window start/end
> > columns
> > > > special
> > > > > > > and has a semantic binding with special table functions
> (TUMBLE,
> > HOP,
> > > > > > > SESSION), which then becomes really not a SQL thing. For
> > example, we
> > > > can
> > > > > > > allow users to define their own windowing table function. In
> that
> > > > case, how
> > > > > > > will you utilize window start/end produced by a customized
> > windowing
> > > > table
> > > > > > > function? What if users produce wired windows that have
> > overlapped
> > > > window
> > > > > > > starts or window ends?
> > > > > > >
> > > > > > > Keeping windows start/end as a part of the table, treating them
> > no
> > > > > > > different from other columns, could give a consistent behavior
> > for
> > > > either
> > > > > > > built-in table function or user-defined table function.
> > > > > > >
> > > > > > > If you think it is too verbose, there are two options to
> > optimize:
> > > > > > >
> > > > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you
> will
> > > > only need
> > > > > > > either window start or end, so you can simplify it, for
> example,
> > to
> > > > > > > L.window_start = R.window_start only.
> > > > > > > 2. (not recommended), you can cut off **L.window_start =
> > > > R.window_start AND
> > > > > > > L.window_end = R.window_end**, but add window metadata
> > comparison to
> > > > join
> > > > > > > implicitly by execution engine. E.g. you can make up the join
> > > > condition in
> > > > > > > your JoinRel if two inputs are TUMBLE.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -Rui
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <
> > yuzhao.cyz@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Yes, the red part is **L.window_start = R.window_start AND
> > > > L.window_end =
> > > > > > > > R.window_end**
> > > > > > > >
> > > > > > > > > Is this a limitation for "triggered by the watermark of the
> > > > stream”?
> > > > > > > >
> > > > > > > > No, because in most of the cases, there is no need to output
> > the
> > > > > > > > intermediate/partial join records then send retractions.
> > > > > > > >
> > > > > > > >
> > > > > > > > So, how do you think about the condition syntax
> > **L.window_start =
> > > > > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Danny Chan
> > > > > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > > > > >
> > > > > > > > > L.window_start = R.window_start AND L.window_end =
> > R.window_end
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Viliam Durina
> > > > > > Jet Developer
> > > > > > hazelcast®
> > > > > >
> > > > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > > 94402 |
> > > > > > USA
> > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> > hazelcast.com <
> > > > https://www.hazelcast.com>
> > > > > >
> > > > > > --
> > > > > > This message contains confidential information and is intended
> only
> > > > for the
> > > > > > individuals named. If you are not the named addressee you should
> > not
> > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > sender
> > > > > > immediately by e-mail if you have received this e-mail by mistake
> > and
> > > > > > delete this e-mail from your system. E-mail transmission cannot
> be
> > > > > > guaranteed to be secure or error-free as information could be
> > > > intercepted,
> > > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > > viruses.
> > > > > > The sender therefore does not accept liability for any errors or
> > > > omissions
> > > > > > in the contents of this message, which arise as a result of
> e-mail
> > > > > > transmission. If verification is required, please request a
> > hard-copy
> > > > > > version. -Hazelcast
> > > >
> >
>


-- 
Viliam Durina
Jet Developer
      hazelcast®

  <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
USA
+1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Re: [ DISCUSS] The window table functions join syntax

Posted by Kenneth Knowles <ke...@apache.org>.
It happens in "raw Beam" somewhat often that there is one windowed
aggregation followed by another. It is pretty unheard of to "assign"
windows twice before an aggregation, because Beam cannot do anything useful
with this: Beam does not have the ability to have two different
fields/columns that are both associated with a watermark. So TUMBLE
followed by TUMBLE is more expressive. You get a number of columns all of
which can be used for windowed aggregation.

One reason this could happen is that you are authoring a table function
that operates generically. It does not know that its input has already been
TUMBLED. In Beam you would reify the prior window information, do your own
TUMBLE. Likely you would do an aggregation and then restore the prior
window information somehow. A compelling example does not come to mind, but
I think considering applying TUMBLE to a generic input helps clarify that
it could happen in code worth allowing.

Kenn

On Wed, Sep 23, 2020 at 11:47 AM Julian Hyde <jh...@apache.org> wrote:

> I don't think it's a problem with table functions in general. And
> besides, we can't change the semantics of table functions. A table
> function must not produce duplicate column names.
>
> The problem is with the semantics of these particular table functions
> - HOP, TUMBLE, SESSION - and what semantics are desirable depends on
> how people will typically use them. Is it common to follow TUMBLE with
> TUMBLE? What would a user expect to be the output columns?
>
> On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <am...@apache.org> wrote:
> >
> > >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > generally want two sets of window_start, window_end columns?
> >
> > I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to HOP
> > join, as long as there is a real requirement there. The window
> starts/ends
> > are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
> > window start and a window end, same for R. This is no different from a
> > normal JOIN case where both JOIN sides have the same column names (but
> they
> > are not considered duplicates). The SQL rule is still applying: within a
> > scope there shouldn't be ambiguous column names (e.g. duplicated column
> > name). For JOIN duplicate names from JOIN inputs are differentiated by
> > table alias.
> >
> >
> > Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this is an
> > interesting case that is different from the JOIN case, and I also think
> > this is a general case (not limited to TUMBLE).
> >
> > Think about that for any query that uses table function of the pattern
> > in CALCITE-4274. The first table function generates column A and then it
> > becomes the input for the second table function, which also wants to
> append
> > a column named "A". How should Calcite handle this case?
> >
> >
> > -Rui
> >
> >
> >
> > On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org> wrote:
> >
> > > I think we should also discuss
> > > https://issues.apache.org/jira/browse/CALCITE-4274 here.
> > >
> > > We've never discussed what should happen if you apply TUMBLE to TUMBLE
> > > (or TUMBLE to HOP, etc.). What happens now is that you get duplicate
> > > columns.
> > >
> > > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > generally want two sets of window_start, window_end columns?
> > >
> > > Julian
> > >
> > > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com>
> wrote:
> > > >
> > > > Thanks for the feedback, I agree we should keep the verbose part
> > > >
> > > > **L.window_start = R.window_start AND L.window_end =R.window_end**
> > > >
> > > > Which would make the semantic more clear ~
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> > > > > You can also use
> > > > >
> > > > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > > > FROM
> > > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > > > JOIN
> > > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > > > USING (f0, window_start)
> > > > >
> > > > > Viliam
> > > > >
> > > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org>
> wrote:
> > > > >
> > > > > > Regarding to **L.window_start = R.window_start AND L.window_end =
> > > > > > R.window_end**:
> > > > > >
> > > > > > In general, the current table function windowing model is to
> append
> > > window
> > > > > > metadata to table directly, thus window metadata becomes a part
> of
> > > table
> > > > > > (or call it data). So as a part of table, these two columns
> should be
> > > > > > treated as normal columns thus they should be in the join on
> > > condition.
> > > > > >
> > > > > > If you want to make it optional, it makes window start/end
> columns
> > > special
> > > > > > and has a semantic binding with special table functions (TUMBLE,
> HOP,
> > > > > > SESSION), which then becomes really not a SQL thing. For
> example, we
> > > can
> > > > > > allow users to define their own windowing table function. In that
> > > case, how
> > > > > > will you utilize window start/end produced by a customized
> windowing
> > > table
> > > > > > function? What if users produce wired windows that have
> overlapped
> > > window
> > > > > > starts or window ends?
> > > > > >
> > > > > > Keeping windows start/end as a part of the table, treating them
> no
> > > > > > different from other columns, could give a consistent behavior
> for
> > > either
> > > > > > built-in table function or user-defined table function.
> > > > > >
> > > > > > If you think it is too verbose, there are two options to
> optimize:
> > > > > >
> > > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will
> > > only need
> > > > > > either window start or end, so you can simplify it, for example,
> to
> > > > > > L.window_start = R.window_start only.
> > > > > > 2. (not recommended), you can cut off **L.window_start =
> > > R.window_start AND
> > > > > > L.window_end = R.window_end**, but add window metadata
> comparison to
> > > join
> > > > > > implicitly by execution engine. E.g. you can make up the join
> > > condition in
> > > > > > your JoinRel if two inputs are TUMBLE.
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Rui
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <
> yuzhao.cyz@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Yes, the red part is **L.window_start = R.window_start AND
> > > L.window_end =
> > > > > > > R.window_end**
> > > > > > >
> > > > > > > > Is this a limitation for "triggered by the watermark of the
> > > stream”?
> > > > > > >
> > > > > > > No, because in most of the cases, there is no need to output
> the
> > > > > > > intermediate/partial join records then send retractions.
> > > > > > >
> > > > > > >
> > > > > > > So, how do you think about the condition syntax
> **L.window_start =
> > > > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > > > >
> > > > > > > Best,
> > > > > > > Danny Chan
> > > > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > > > >
> > > > > > > > L.window_start = R.window_start AND L.window_end =
> R.window_end
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Viliam Durina
> > > > > Jet Developer
> > > > > hazelcast®
> > > > >
> > > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > 94402 |
> > > > > USA
> > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> hazelcast.com <
> > > https://www.hazelcast.com>
> > > > >
> > > > > --
> > > > > This message contains confidential information and is intended only
> > > for the
> > > > > individuals named. If you are not the named addressee you should
> not
> > > > > disseminate, distribute or copy this e-mail. Please notify the
> sender
> > > > > immediately by e-mail if you have received this e-mail by mistake
> and
> > > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > > guaranteed to be secure or error-free as information could be
> > > intercepted,
> > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > viruses.
> > > > > The sender therefore does not accept liability for any errors or
> > > omissions
> > > > > in the contents of this message, which arise as a result of e-mail
> > > > > transmission. If verification is required, please request a
> hard-copy
> > > > > version. -Hazelcast
> > >
>

Re: [ DISCUSS] The window table functions join syntax

Posted by Julian Hyde <jh...@apache.org>.
I don't think it's a problem with table functions in general. And
besides, we can't change the semantics of table functions. A table
function must not produce duplicate column names.

The problem is with the semantics of these particular table functions
- HOP, TUMBLE, SESSION - and what semantics are desirable depends on
how people will typically use them. Is it common to follow TUMBLE with
TUMBLE? What would a user expect to be the output columns?

On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <am...@apache.org> wrote:
>
> >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > generally want two sets of window_start, window_end columns?
>
> I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to HOP
> join, as long as there is a real requirement there. The window starts/ends
> are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
> window start and a window end, same for R. This is no different from a
> normal JOIN case where both JOIN sides have the same column names (but they
> are not considered duplicates). The SQL rule is still applying: within a
> scope there shouldn't be ambiguous column names (e.g. duplicated column
> name). For JOIN duplicate names from JOIN inputs are differentiated by
> table alias.
>
>
> Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this is an
> interesting case that is different from the JOIN case, and I also think
> this is a general case (not limited to TUMBLE).
>
> Think about that for any query that uses table function of the pattern
> in CALCITE-4274. The first table function generates column A and then it
> becomes the input for the second table function, which also wants to append
> a column named "A". How should Calcite handle this case?
>
>
> -Rui
>
>
>
> On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org> wrote:
>
> > I think we should also discuss
> > https://issues.apache.org/jira/browse/CALCITE-4274 here.
> >
> > We've never discussed what should happen if you apply TUMBLE to TUMBLE
> > (or TUMBLE to HOP, etc.). What happens now is that you get duplicate
> > columns.
> >
> > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > generally want two sets of window_start, window_end columns?
> >
> > Julian
> >
> > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com> wrote:
> > >
> > > Thanks for the feedback, I agree we should keep the verbose part
> > >
> > > **L.window_start = R.window_start AND L.window_end =R.window_end**
> > >
> > > Which would make the semantic more clear ~
> > >
> > > Best,
> > > Danny Chan
> > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> > > > You can also use
> > > >
> > > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > > FROM
> > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > > JOIN
> > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > > USING (f0, window_start)
> > > >
> > > > Viliam
> > > >
> > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org> wrote:
> > > >
> > > > > Regarding to **L.window_start = R.window_start AND L.window_end =
> > > > > R.window_end**:
> > > > >
> > > > > In general, the current table function windowing model is to append
> > window
> > > > > metadata to table directly, thus window metadata becomes a part of
> > table
> > > > > (or call it data). So as a part of table, these two columns should be
> > > > > treated as normal columns thus they should be in the join on
> > condition.
> > > > >
> > > > > If you want to make it optional, it makes window start/end columns
> > special
> > > > > and has a semantic binding with special table functions (TUMBLE, HOP,
> > > > > SESSION), which then becomes really not a SQL thing. For example, we
> > can
> > > > > allow users to define their own windowing table function. In that
> > case, how
> > > > > will you utilize window start/end produced by a customized windowing
> > table
> > > > > function? What if users produce wired windows that have overlapped
> > window
> > > > > starts or window ends?
> > > > >
> > > > > Keeping windows start/end as a part of the table, treating them no
> > > > > different from other columns, could give a consistent behavior for
> > either
> > > > > built-in table function or user-defined table function.
> > > > >
> > > > > If you think it is too verbose, there are two options to optimize:
> > > > >
> > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will
> > only need
> > > > > either window start or end, so you can simplify it, for example, to
> > > > > L.window_start = R.window_start only.
> > > > > 2. (not recommended), you can cut off **L.window_start =
> > R.window_start AND
> > > > > L.window_end = R.window_end**, but add window metadata comparison to
> > join
> > > > > implicitly by execution engine. E.g. you can make up the join
> > condition in
> > > > > your JoinRel if two inputs are TUMBLE.
> > > > >
> > > > >
> > > > >
> > > > > -Rui
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Yes, the red part is **L.window_start = R.window_start AND
> > L.window_end =
> > > > > > R.window_end**
> > > > > >
> > > > > > > Is this a limitation for "triggered by the watermark of the
> > stream”?
> > > > > >
> > > > > > No, because in most of the cases, there is no need to output the
> > > > > > intermediate/partial join records then send retractions.
> > > > > >
> > > > > >
> > > > > > So, how do you think about the condition syntax **L.window_start =
> > > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > > >
> > > > > > Best,
> > > > > > Danny Chan
> > > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > > >
> > > > > > > L.window_start = R.window_start AND L.window_end = R.window_end
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Viliam Durina
> > > > Jet Developer
> > > > hazelcast®
> > > >
> > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > 94402 |
> > > > USA
> > > > +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com <
> > https://www.hazelcast.com>
> > > >
> > > > --
> > > > This message contains confidential information and is intended only
> > for the
> > > > individuals named. If you are not the named addressee you should not
> > > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > > immediately by e-mail if you have received this e-mail by mistake and
> > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > guaranteed to be secure or error-free as information could be
> > intercepted,
> > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > viruses.
> > > > The sender therefore does not accept liability for any errors or
> > omissions
> > > > in the contents of this message, which arise as a result of e-mail
> > > > transmission. If verification is required, please request a hard-copy
> > > > version. -Hazelcast
> >

Re: [ DISCUSS] The window table functions join syntax

Posted by Rui Wang <am...@apache.org>.
>Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> generally want two sets of window_start, window_end columns?

I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to HOP
join, as long as there is a real requirement there. The window starts/ends
are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
window start and a window end, same for R. This is no different from a
normal JOIN case where both JOIN sides have the same column names (but they
are not considered duplicates). The SQL rule is still applying: within a
scope there shouldn't be ambiguous column names (e.g. duplicated column
name). For JOIN duplicate names from JOIN inputs are differentiated by
table alias.


Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this is an
interesting case that is different from the JOIN case, and I also think
this is a general case (not limited to TUMBLE).

Think about that for any query that uses table function of the pattern
in CALCITE-4274. The first table function generates column A and then it
becomes the input for the second table function, which also wants to append
a column named "A". How should Calcite handle this case?


-Rui



On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org> wrote:

> I think we should also discuss
> https://issues.apache.org/jira/browse/CALCITE-4274 here.
>
> We've never discussed what should happen if you apply TUMBLE to TUMBLE
> (or TUMBLE to HOP, etc.). What happens now is that you get duplicate
> columns.
>
> Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> generally want two sets of window_start, window_end columns?
>
> Julian
>
> On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com> wrote:
> >
> > Thanks for the feedback, I agree we should keep the verbose part
> >
> > **L.window_start = R.window_start AND L.window_end =R.window_end**
> >
> > Which would make the semantic more clear ~
> >
> > Best,
> > Danny Chan
> > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> > > You can also use
> > >
> > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > FROM
> > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > JOIN
> > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > USING (f0, window_start)
> > >
> > > Viliam
> > >
> > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org> wrote:
> > >
> > > > Regarding to **L.window_start = R.window_start AND L.window_end =
> > > > R.window_end**:
> > > >
> > > > In general, the current table function windowing model is to append
> window
> > > > metadata to table directly, thus window metadata becomes a part of
> table
> > > > (or call it data). So as a part of table, these two columns should be
> > > > treated as normal columns thus they should be in the join on
> condition.
> > > >
> > > > If you want to make it optional, it makes window start/end columns
> special
> > > > and has a semantic binding with special table functions (TUMBLE, HOP,
> > > > SESSION), which then becomes really not a SQL thing. For example, we
> can
> > > > allow users to define their own windowing table function. In that
> case, how
> > > > will you utilize window start/end produced by a customized windowing
> table
> > > > function? What if users produce wired windows that have overlapped
> window
> > > > starts or window ends?
> > > >
> > > > Keeping windows start/end as a part of the table, treating them no
> > > > different from other columns, could give a consistent behavior for
> either
> > > > built-in table function or user-defined table function.
> > > >
> > > > If you think it is too verbose, there are two options to optimize:
> > > >
> > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will
> only need
> > > > either window start or end, so you can simplify it, for example, to
> > > > L.window_start = R.window_start only.
> > > > 2. (not recommended), you can cut off **L.window_start =
> R.window_start AND
> > > > L.window_end = R.window_end**, but add window metadata comparison to
> join
> > > > implicitly by execution engine. E.g. you can make up the join
> condition in
> > > > your JoinRel if two inputs are TUMBLE.
> > > >
> > > >
> > > >
> > > > -Rui
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com>
> wrote:
> > > >
> > > > > Yes, the red part is **L.window_start = R.window_start AND
> L.window_end =
> > > > > R.window_end**
> > > > >
> > > > > > Is this a limitation for "triggered by the watermark of the
> stream”?
> > > > >
> > > > > No, because in most of the cases, there is no need to output the
> > > > > intermediate/partial join records then send retractions.
> > > > >
> > > > >
> > > > > So, how do you think about the condition syntax **L.window_start =
> > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > >
> > > > > Best,
> > > > > Danny Chan
> > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > >
> > > > > > L.window_start = R.window_start AND L.window_end = R.window_end
> > > > >
> > > >
> > >
> > >
> > > --
> > > Viliam Durina
> > > Jet Developer
> > > hazelcast®
> > >
> > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> 94402 |
> > > USA
> > > +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com <
> https://www.hazelcast.com>
> > >
> > > --
> > > This message contains confidential information and is intended only
> for the
> > > individuals named. If you are not the named addressee you should not
> > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > immediately by e-mail if you have received this e-mail by mistake and
> > > delete this e-mail from your system. E-mail transmission cannot be
> > > guaranteed to be secure or error-free as information could be
> intercepted,
> > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> viruses.
> > > The sender therefore does not accept liability for any errors or
> omissions
> > > in the contents of this message, which arise as a result of e-mail
> > > transmission. If verification is required, please request a hard-copy
> > > version. -Hazelcast
>

Re: [ DISCUSS] The window table functions join syntax

Posted by Julian Hyde <jh...@apache.org>.
I think we should also discuss
https://issues.apache.org/jira/browse/CALCITE-4274 here.

We've never discussed what should happen if you apply TUMBLE to TUMBLE
(or TUMBLE to HOP, etc.). What happens now is that you get duplicate
columns.

Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
generally want two sets of window_start, window_end columns?

Julian

On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yu...@gmail.com> wrote:
>
> Thanks for the feedback, I agree we should keep the verbose part
>
> **L.window_start = R.window_start AND L.window_end =R.window_end**
>
> Which would make the semantic more clear ~
>
> Best,
> Danny Chan
> 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> > You can also use
> >
> > SELECT L.f0, R.f2, L.window_start, L.window_end
> > FROM
> > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > JOIN
> > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > USING (f0, window_start)
> >
> > Viliam
> >
> > On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org> wrote:
> >
> > > Regarding to **L.window_start = R.window_start AND L.window_end =
> > > R.window_end**:
> > >
> > > In general, the current table function windowing model is to append window
> > > metadata to table directly, thus window metadata becomes a part of table
> > > (or call it data). So as a part of table, these two columns should be
> > > treated as normal columns thus they should be in the join on condition.
> > >
> > > If you want to make it optional, it makes window start/end columns special
> > > and has a semantic binding with special table functions (TUMBLE, HOP,
> > > SESSION), which then becomes really not a SQL thing. For example, we can
> > > allow users to define their own windowing table function. In that case, how
> > > will you utilize window start/end produced by a customized windowing table
> > > function? What if users produce wired windows that have overlapped window
> > > starts or window ends?
> > >
> > > Keeping windows start/end as a part of the table, treating them no
> > > different from other columns, could give a consistent behavior for either
> > > built-in table function or user-defined table function.
> > >
> > > If you think it is too verbose, there are two options to optimize:
> > >
> > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will only need
> > > either window start or end, so you can simplify it, for example, to
> > > L.window_start = R.window_start only.
> > > 2. (not recommended), you can cut off **L.window_start = R.window_start AND
> > > L.window_end = R.window_end**, but add window metadata comparison to join
> > > implicitly by execution engine. E.g. you can make up the join condition in
> > > your JoinRel if two inputs are TUMBLE.
> > >
> > >
> > >
> > > -Rui
> > >
> > >
> > >
> > >
> > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com> wrote:
> > >
> > > > Yes, the red part is **L.window_start = R.window_start AND L.window_end =
> > > > R.window_end**
> > > >
> > > > > Is this a limitation for "triggered by the watermark of the stream”?
> > > >
> > > > No, because in most of the cases, there is no need to output the
> > > > intermediate/partial join records then send retractions.
> > > >
> > > >
> > > > So, how do you think about the condition syntax **L.window_start =
> > > > R.window_start AND L.window_end = R.window_end** ?
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > >
> > > > > L.window_start = R.window_start AND L.window_end = R.window_end
> > > >
> > >
> >
> >
> > --
> > Viliam Durina
> > Jet Developer
> > hazelcast®
> >
> > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
> > USA
> > +1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>
> >
> > --
> > This message contains confidential information and is intended only for the
> > individuals named. If you are not the named addressee you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately by e-mail if you have received this e-mail by mistake and
> > delete this e-mail from your system. E-mail transmission cannot be
> > guaranteed to be secure or error-free as information could be intercepted,
> > corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> > The sender therefore does not accept liability for any errors or omissions
> > in the contents of this message, which arise as a result of e-mail
> > transmission. If verification is required, please request a hard-copy
> > version. -Hazelcast

Re: [ DISCUSS] The window table functions join syntax

Posted by Danny Chan <yu...@gmail.com>.
Thanks for the feedback, I agree we should keep the verbose part

**L.window_start = R.window_start AND L.window_end =R.window_end**

Which would make the semantic more clear ~

Best,
Danny Chan
在 2020年9月23日 +0800 PM3:24,Viliam Durina <vi...@hazelcast.com>,写道:
> You can also use
>
> SELECT L.f0, R.f2, L.window_start, L.window_end
> FROM
> Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> JOIN
> Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> USING (f0, window_start)
>
> Viliam
>
> On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org> wrote:
>
> > Regarding to **L.window_start = R.window_start AND L.window_end =
> > R.window_end**:
> >
> > In general, the current table function windowing model is to append window
> > metadata to table directly, thus window metadata becomes a part of table
> > (or call it data). So as a part of table, these two columns should be
> > treated as normal columns thus they should be in the join on condition.
> >
> > If you want to make it optional, it makes window start/end columns special
> > and has a semantic binding with special table functions (TUMBLE, HOP,
> > SESSION), which then becomes really not a SQL thing. For example, we can
> > allow users to define their own windowing table function. In that case, how
> > will you utilize window start/end produced by a customized windowing table
> > function? What if users produce wired windows that have overlapped window
> > starts or window ends?
> >
> > Keeping windows start/end as a part of the table, treating them no
> > different from other columns, could give a consistent behavior for either
> > built-in table function or user-defined table function.
> >
> > If you think it is too verbose, there are two options to optimize:
> >
> > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will only need
> > either window start or end, so you can simplify it, for example, to
> > L.window_start = R.window_start only.
> > 2. (not recommended), you can cut off **L.window_start = R.window_start AND
> > L.window_end = R.window_end**, but add window metadata comparison to join
> > implicitly by execution engine. E.g. you can make up the join condition in
> > your JoinRel if two inputs are TUMBLE.
> >
> >
> >
> > -Rui
> >
> >
> >
> >
> > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com> wrote:
> >
> > > Yes, the red part is **L.window_start = R.window_start AND L.window_end =
> > > R.window_end**
> > >
> > > > Is this a limitation for "triggered by the watermark of the stream”?
> > >
> > > No, because in most of the cases, there is no need to output the
> > > intermediate/partial join records then send retractions.
> > >
> > >
> > > So, how do you think about the condition syntax **L.window_start =
> > > R.window_start AND L.window_end = R.window_end** ?
> > >
> > > Best,
> > > Danny Chan
> > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > >
> > > > L.window_start = R.window_start AND L.window_end = R.window_end
> > >
> >
>
>
> --
> Viliam Durina
> Jet Developer
> hazelcast®
>
> <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
> USA
> +1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>
>
> --
> This message contains confidential information and is intended only for the
> individuals named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. If verification is required, please request a hard-copy
> version. -Hazelcast

Re: [ DISCUSS] The window table functions join syntax

Posted by Viliam Durina <vi...@hazelcast.com>.
You can also use

SELECT L.f0, R.f2, L.window_start, L.window_end
FROM
Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
JOIN
Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
USING (f0, window_start)

Viliam

On Wed, 23 Sep 2020 at 08:02, Rui Wang <am...@apache.org> wrote:

> Regarding to **L.window_start = R.window_start AND L.window_end =
> R.window_end**:
>
> In general, the current table function windowing model is to append window
> metadata to table directly, thus window metadata becomes a part of table
> (or call it data). So as a part of table, these two columns should be
> treated as normal columns thus they should be in the join on condition.
>
> If you want to make it optional, it makes window start/end columns special
> and has a semantic binding with special table functions (TUMBLE, HOP,
> SESSION), which then becomes really not a SQL thing. For example, we can
> allow users to define their own windowing table function. In that case, how
> will you utilize window start/end produced by a customized windowing table
> function? What if users produce wired windows that have overlapped window
> starts or window ends?
>
> Keeping windows start/end as a part of the table, treating them no
> different from other columns, could give a consistent behavior for either
> built-in table function or user-defined table function.
>
> If you think it is too verbose, there are two options to optimize:
>
> 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will only need
> either window start or end, so you can simplify it, for example, to
> L.window_start = R.window_start only.
> 2. (not recommended), you can cut off **L.window_start = R.window_start AND
> L.window_end = R.window_end**, but add window metadata comparison to join
> implicitly by execution engine. E.g. you can make up the join condition in
> your JoinRel if two inputs are TUMBLE.
>
>
>
> -Rui
>
>
>
>
> On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com> wrote:
>
> > Yes, the red part is **L.window_start = R.window_start AND L.window_end =
> > R.window_end**
> >
> > > Is this a limitation for "triggered by the watermark of the stream”?
> >
> > No, because in most of the cases, there is no need to output the
> > intermediate/partial join records then send retractions.
> >
> >
> > So, how do you think about the condition syntax **L.window_start =
> > R.window_start AND L.window_end = R.window_end** ?
> >
> > Best,
> > Danny Chan
> > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > >
> > > L.window_start = R.window_start AND L.window_end = R.window_end
> >
>


-- 
Viliam Durina
Jet Developer
      hazelcast®

  <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
USA
+1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Re: [ DISCUSS] The window table functions join syntax

Posted by Rui Wang <am...@apache.org>.
Regarding to **L.window_start = R.window_start AND L.window_end =
R.window_end**:

In general, the current table function windowing model is to append window
metadata to table directly, thus window metadata becomes a part of table
(or call it data). So as a part of table, these two columns should be
treated as normal columns thus they should be in the join on condition.

If you want to make it optional, it makes window start/end columns special
and has a semantic binding with special table functions (TUMBLE, HOP,
SESSION), which then becomes really not a SQL thing. For example, we can
allow users to define their own windowing table function. In that case, how
will you utilize window start/end produced by a customized windowing table
function? What if users produce wired windows that have overlapped window
starts or window ends?

Keeping windows start/end as a part of the table, treating them no
different from other columns, could give a consistent behavior for either
built-in table function or user-defined table function.

If you think it is too verbose, there are two options to optimize:

1. for TUMBLE/HOP/SESSION, to identify a unique window, you will only need
either window start or end, so you can simplify it, for example, to
L.window_start = R.window_start only.
2. (not recommended), you can cut off **L.window_start = R.window_start AND
L.window_end = R.window_end**, but add window metadata comparison to join
implicitly by execution engine. E.g. you can make up the join condition in
your JoinRel if two inputs are TUMBLE.



-Rui




On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <yu...@gmail.com> wrote:

> Yes, the red part is **L.window_start = R.window_start AND L.window_end =
> R.window_end**
>
> > Is this a limitation for "triggered by the watermark of the stream”?
>
> No, because in most of the cases, there is no need to output the
> intermediate/partial join records then send retractions.
>
>
> So, how do you think about the condition syntax **L.window_start =
> R.window_start AND L.window_end = R.window_end** ?
>
> Best,
> Danny Chan
> 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> >
> > L.window_start = R.window_start AND L.window_end = R.window_end
>

Re: [ DISCUSS] The window table functions join syntax

Posted by Danny Chan <yu...@gmail.com>.
Yes, the red part is **L.window_start = R.window_start AND L.window_end = R.window_end**

> Is this a limitation for "triggered by the watermark of the stream”?

No, because in most of the cases, there is no need to output the intermediate/partial join records then send retractions.


So, how do you think about the condition syntax **L.window_start = R.window_start AND L.window_end = R.window_end** ?

Best,
Danny Chan
在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
>
> L.window_start = R.window_start AND L.window_end = R.window_end

Re: [ DISCUSS] The window table functions join syntax

Posted by Rui Wang <am...@apache.org>.
Hi Danny,

seems like the "Red" syntax is missing? I am guessing you were talking
about "L.window_start = R.window_start AND L.window_end = R.window_end"?
Can you use ** to indicate the red part.

Two additional questions:

> The Join action is triggered by the watermark of the stream
So sounds like the you define the default as "trigger when window close"
(i.e. watermark passes end of window)

> The join does not produce retractions of the stream, the mainly
difference with normal two-stream join
Is this a limitation for "triggered by the watermark of the stream"?


-Rui

On Tue, Sep 22, 2020 at 9:42 PM James Starr <ja...@gmail.com> wrote:

> Sorry, but what part is red.  Your markup appears to have been stripped.
>
> James
>
> On Tue, Sep 22, 2020 at 9:05 PM Danny Chan <yu...@gmail.com> wrote:
>
> > Hi, fellows, long time no see on the mailing ~
> >
> > Here I want to have a discussion on the join syntax of our recently
> > introduced window table function ~
> >
> > For example, we can define a tumbling window function of 5 minutes size
> as:
> >
> > Tumble(table T, descriptor(T.ts), INTERVAL ‘5’ MINUTE)
> >
> > The we can select from it, and moreover, I want to support 2 window
> > function join for the streaming query recently.
> >
> > The semantics of the windowed stream join is:
> >
> > • The 2 window inputs should have the same window arguments (except for
> > the table name), e.g. for TUMBLE the size should be equal, for HOP, both
> > the side interval and size should be equal
> > • We first window the input stream then join the both window data set of
> > the same TimeWindow
> > • The Join action is triggered by the watermark of the stream
> > • The join does not produce retractions of the stream, the mainly
> > difference with normal two-stream join
> >
> > And I want to propose a join syntax as:
> >
> > Select L.f0, R.f2, L.window_start, L.window_end
> > FROM
> > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > JOIN
> > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > ON
> > L.f0 = R.f0 AND L.window_start = R.window_start AND L.window_end =
> > R.window_end
> >
> > The red syntax part is what I want to discuss, the condition seems too
> > verbose because
> > user need to declare it every time.
> >
> >
> > • Should we make it optional ?
> > • Is there better syntax to describe this window join semantics ?
> >
> >
> > Best,
> > Danny Chan
> >
>

Re: [ DISCUSS] The window table functions join syntax

Posted by James Starr <ja...@gmail.com>.
Sorry, but what part is red.  Your markup appears to have been stripped.

James

On Tue, Sep 22, 2020 at 9:05 PM Danny Chan <yu...@gmail.com> wrote:

> Hi, fellows, long time no see on the mailing ~
>
> Here I want to have a discussion on the join syntax of our recently
> introduced window table function ~
>
> For example, we can define a tumbling window function of 5 minutes size as:
>
> Tumble(table T, descriptor(T.ts), INTERVAL ‘5’ MINUTE)
>
> The we can select from it, and moreover, I want to support 2 window
> function join for the streaming query recently.
>
> The semantics of the windowed stream join is:
>
> • The 2 window inputs should have the same window arguments (except for
> the table name), e.g. for TUMBLE the size should be equal, for HOP, both
> the side interval and size should be equal
> • We first window the input stream then join the both window data set of
> the same TimeWindow
> • The Join action is triggered by the watermark of the stream
> • The join does not produce retractions of the stream, the mainly
> difference with normal two-stream join
>
> And I want to propose a join syntax as:
>
> Select L.f0, R.f2, L.window_start, L.window_end
> FROM
> Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> JOIN
> Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> ON
> L.f0 = R.f0 AND L.window_start = R.window_start AND L.window_end =
> R.window_end
>
> The red syntax part is what I want to discuss, the condition seems too
> verbose because
> user need to declare it every time.
>
>
> • Should we make it optional ?
> • Is there better syntax to describe this window join semantics ?
>
>
> Best,
> Danny Chan
>