You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by Danny Chan <yu...@gmail.com> on 2020/06/04 13:02:06 UTC

[DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Hi, all, in the last 2 releases, Rui Wang has contributed the window table functions to replace the legacy group window syntax, the idea comes from the SQL-2016 polymorphic table functions. But the current implementation also got some feedbacks(or confusion), this thread tries to have a discussion on theses questions that are undecided.

One is about the window function argument names:

=== option1
TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
HOP(table, DESCRIPTOR(time_column), slide, size)

=== option2
TUMBLE(table, DESCRIPTOR(time_column), size, offset)
HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)

I would prefer for option2 for the reasons pointed out by Viliam in CALCITE-3737.


Another is about how to translate the query

For query, select * from TUMBLE(table, DESCRIPTOR(time_column), interval);

Currently our outputs plan is:

LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2], window_end=[$3])
LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1), interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID, TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0) window_end)])
LogicalProject(ORDERID=[$0], ROWTIME=[$1])
LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])

But this seems not right, because the "table table_name" argument is not a relational expression, in CALCITE-3955, Julian has pointed out that we should translate the "table" argument as a CURSOR (or TABLE by Viliam), but another question is how to translate the referenced name "time_column", to a correlate variable ? which is also confusing because there is no correlation in the plan actually.

Any suggestions are appreciated !


[1] https://issues.apache.org/jira/browse/CALCITE-3955
[2] https://issues.apache.org/jira/browse/CALCITE-3737

Best,
Danny Chan

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Danny Chan <da...@apache.org>.
Thanks for the sharing, after some research we found that many DB engines
has poor support for nested row type,
for example, assumes:
my_type is a UDT with schema as (a: int, b: varchar(20))
table T has schema as (f0: my_type, f1: int, f2: varchar(20))

PostgreSQL:
select f0.a, max(f1) from T group by f0 -- fails because it recognize the
"f0.a" f0 as a table name
select (f0).a, max(f1) from T group by f0 -- succeed

Oracle:
select f0.a, max(f1) from T group by f0 -- fails because group by UDT is
not supported

In Calcite, there is a fact that the nested row optimize rules are poor
supported, such as the project prune, transpose.
The nested row nullability is also a topic that needs to be clarified.
Currently, when a nested type field is nullable, all it's fields are
nullable too,
that is not true for Java POJO: User(id: int, name: string) (e.g. its
fields have un-boxed type)


Pengcheng Liu <pe...@gmail.com> 于2020年10月23日周五 下午1:46写道:

> Hi, Danny,
>    I think there is already an explanation in [1], as the author noted:
> > The return value of Tumble is a relation that includes all columns of
> data as well as
> > additional event time columns wstart and wend. It was considered to place
> the original
> > row in a nested row and wstart and wend in a separate nested row, for
> simplicity of
> > namespacing, but that would limit these extensions to engines supporting
> nested rows.
>
> [1]
>
> https://docs.google.com/document/d/138uA7VTpbF84CFrd--cz3YVe0-AQ9ALnsavaSE2JeE4
>
>
> Danny Chan <da...@apache.org> 于2020年10月22日周四 下午7:40写道:
>
> > Rui Wang, can you help to consult with the author of the window TVF
> syntax
> > ? We want to hear more suggestions.
> >
> > cc @Rui Wang
> >
> > Rui Wang <am...@apache.org> 于2020年10月22日周四 下午12:01写道:
> >
> > > This seems reasonable to merge window metadata as a column (more
> compact
> > > format).
> > >
> > > Maybe Julian can comment whether there was special consideration to
> > define
> > > window_start and window_end as two separate columns?
> > >
> > >
> > > -Rui
> > >
> > > On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org>
> wrote:
> > >
> > > > Some of our fellows suggest the window table value functions to
> append
> > a
> > > > structure type field:
> > > >
> > > > window: ROW(start: timestamp(3), end: timestamp(3))
> > > >
> > > > For example if
> > > >
> > > > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> > > >
> > > > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> > > >
> > > > has return type
> > > >
> > > > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3),
> end
> > > > timestamp(3)))
> > > >
> > > > Thus in the aggregate key or the join condition, we can compare the
> > > > structure “window” field directly, e.g.
> > > >
> > > > GROUP BY T.window or ON L.window = R.window
> > > >
> > > > What do you think about this ?
> > > >
> > > > Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
> > > >
> > > > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <
> viliam@hazelcast.com>
> > > > > wrote:
> > > > >
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > >
> > > > > > It's not about how often you emit. By "emit" I mean the moment
> when
> > > the
> > > > > > rows are sent out. For example, when working with event time and
> > the
> > > > > input
> > > > > > stalls, you might emit much later. Or you can use it for
> historical
> > > > data
> > > > > in
> > > > > > which case the actual interval will be much shorter. It's the
> time
> > > > > interval
> > > > > > by which the window "hops". Two subsequent windows will be this
> > much
> > > > > apart.
> > > > > >
> > > > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented
> by
> > up
> > > > to
> > > > > 4
> > > > > > member functions: describe, start, fulfill, finish. The
> `describe`
> > > > > function
> > > > > > is supposed to be called during validation. It can validate the
> > > > > arguments.
> > > > > > In case of TUMBLE, it can check whether the column described by
> the
> > > > > > descriptor is present in the input table and if it is of a
> correct
> > > > type.
> > > > > It
> > > > > > can throw errors which will become validation errors.
> > > > >
> > > > > Yes. Descriptor has a feature to enable a type checking. If
> > considering
> > > > it,
> > > > > functions will be (argument name not decided yet):
> > > > >
> > > > > TUMBLE(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > HOP(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > It also determines
> > > > > > the actual output table type.
> > > > > >
> > > > > > I think it's worthwhile to read the PTF spec before implementing
> > > this,
> > > > > it's
> > > > > > freely available here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > > > >
> > > > > > Viliam
> > > > > >
> > > > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Thanks Danny for your summary!
> > > > > > >
> > > > > > > For question one, I prefer the ordering of parameter in option
> > two.
> > > > > > > However, for argument names, I think it might be better to
> > > > > > > 1. name the window size explicitly by "window_size".
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > > > 3. don't use "table" as the first argument name, because
> "table"
> > > is a
> > > > > > > keyword. So we can replace it by "data"
> > > > > > >
> > > > > > > so it might be
> > > > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > > > HOP(data, DESCRIPTOR(time_column), window_size,
> emit_frequency[,
> > > > > offset])
> > > > > > >
> > > > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > > > discussed
> > > > > > in
> > > > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > > > signature:
> > > > > > > TUMBLE(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > HOP(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > So those intervals have to be INTERVAL DAY TO SECOND, not
> > INTERVAL
> > > > YEAR
> > > > > > TO
> > > > > > > MONTH, because year to month varies in seconds.
> > > > > > >
> > > > > > >
> > > > > > > Regarding question two, I tried to dig into codebase and I
> think
> > it
> > > > > could
> > > > > > > be a RexCorrelVariable to replace that table_name, so it
> becomes
> > > > > > > table_function(RexCorrelVariable, DESCRIPTOR($1),
> > internal:INTERVAL
> > > > > > MINUTE)
> > > > > > >
> > > > > > > I know RexCorrelVariable might be not designed for this
> purpose,
> > > but
> > > > I
> > > > > > > find RexCorrelVariable is really useful because:
> > > > > > > 1. It can provide an id, to indicate which input of
> > > TableFunctionScan
> > > > > > that
> > > > > > > this table_function is applied on. Note that TableFunctionScan
> > > might
> > > > > not
> > > > > > > have one input.
> > > > > > > 2. It can provide RelDataType, to save the referenced input's
> > > schema.
> > > > > The
> > > > > > > input schema is useful during operator argument type validation
> > > (for
> > > > > > > DESCRIPTOR) and operator return type inference.
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > > > >
> > > > > > >
> > > > > > > -Rui
> > > > > > >
> > > > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <
> yuzhao.cyz@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > > > window
> > > > > > > table
> > > > > > > > functions to replace the legacy group window syntax, the idea
> > > comes
> > > > > > from
> > > > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > > > implementation
> > > > > > > > also got some feedbacks(or confusion), this thread tries to
> > have
> > > a
> > > > > > > > discussion on theses questions that are undecided.
> > > > > > > >
> > > > > > > > One is about the window function argument names:
> > > > > > > >
> > > > > > > > === option1
> > > > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > > > >
> > > > > > > > === option2
> > > > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > > > >
> > > > > > > > I would prefer for option2 for the reasons pointed out by
> > Viliam
> > > in
> > > > > > > > CALCITE-3737.
> > > > > > > >
> > > > > > > >
> > > > > > > > Another is about how to translate the query
> > > > > > > >
> > > > > > > > For query, select * from TUMBLE(table,
> DESCRIPTOR(time_column),
> > > > > > > interval);
> > > > > > > >
> > > > > > > > Currently our outputs plan is:
> > > > > > > >
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > > > window_end=[$3])
> > > > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER
> ORDERID,
> > > > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start,
> > TIMESTAMP(0)
> > > > > > > > window_end)])
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > > > >
> > > > > > > > But this seems not right, because the "table table_name"
> > argument
> > > > is
> > > > > > not
> > > > > > > a
> > > > > > > > relational expression, in CALCITE-3955, Julian has pointed
> out
> > > that
> > > > > we
> > > > > > > > should translate the "table" argument as a CURSOR (or TABLE
> by
> > > > > Viliam),
> > > > > > > but
> > > > > > > > another question is how to translate the referenced name
> > > > > "time_column",
> > > > > > > to
> > > > > > > > a correlate variable ? which is also confusing because there
> is
> > > no
> > > > > > > > correlation in the plan actually.
> > > > > > > >
> > > > > > > > Any suggestions are appreciated !
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Danny Chan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Viliam Durina
> > > > > > Jet Developer
> > > > > >       hazelcast®
> > > > > >
> > > > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo,
> CA
> > > > > 94402 |
> > > > > > USA
> > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> > > > hazelcast.com <
> > > > > > https://www.hazelcast.com>
> > > > > >
> > > > > > --
> > > > > > This message contains confidential information and is intended
> only
> > > for
> > > > > > the
> > > > > > individuals named. If you are not the named addressee you should
> > not
> > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > sender
> > > > > > immediately by e-mail if you have received this e-mail by mistake
> > and
> > > > > > delete this e-mail from your system. E-mail transmission cannot
> be
> > > > > > guaranteed to be secure or error-free as information could be
> > > > > intercepted,
> > > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > > > viruses.
> > > > > > The sender therefore does not accept liability for any errors or
> > > > > omissions
> > > > > > in the contents of this message, which arise as a result of
> e-mail
> > > > > > transmission. If verification is required, please request a
> > hard-copy
> > > > > > version. -Hazelcast
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Rui Wang <am...@apache.org>.
Yes. I had offline chats with Kenneth and Julian. I think the primary
reason that window metadata is not as a STRUCT, is just because STRUCT
might not be well supported by engines. Even in Calcite, ROW (STRUCT) might
now be supported well.

Meanwhile, it seems like there is no serious issue for having two
separate columns. For example, if you want to construct PK for windows, you
can use window_start (or the ends)  as PK for most of the windowing cases.
If you still need wstart and wend as a ROW, you could construct it by row
constructor.

-Rui

On Thu, Oct 22, 2020 at 10:46 PM Pengcheng Liu <pe...@gmail.com>
wrote:

> Hi, Danny,
>    I think there is already an explanation in [1], as the author noted:
> > The return value of Tumble is a relation that includes all columns of
> data as well as
> > additional event time columns wstart and wend. It was considered to place
> the original
> > row in a nested row and wstart and wend in a separate nested row, for
> simplicity of
> > namespacing, but that would limit these extensions to engines supporting
> nested rows.
>
> [1]
>
> https://docs.google.com/document/d/138uA7VTpbF84CFrd--cz3YVe0-AQ9ALnsavaSE2JeE4
>
>
> Danny Chan <da...@apache.org> 于2020年10月22日周四 下午7:40写道:
>
> > Rui Wang, can you help to consult with the author of the window TVF
> syntax
> > ? We want to hear more suggestions.
> >
> > cc @Rui Wang
> >
> > Rui Wang <am...@apache.org> 于2020年10月22日周四 下午12:01写道:
> >
> > > This seems reasonable to merge window metadata as a column (more
> compact
> > > format).
> > >
> > > Maybe Julian can comment whether there was special consideration to
> > define
> > > window_start and window_end as two separate columns?
> > >
> > >
> > > -Rui
> > >
> > > On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org>
> wrote:
> > >
> > > > Some of our fellows suggest the window table value functions to
> append
> > a
> > > > structure type field:
> > > >
> > > > window: ROW(start: timestamp(3), end: timestamp(3))
> > > >
> > > > For example if
> > > >
> > > > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> > > >
> > > > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> > > >
> > > > has return type
> > > >
> > > > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3),
> end
> > > > timestamp(3)))
> > > >
> > > > Thus in the aggregate key or the join condition, we can compare the
> > > > structure “window” field directly, e.g.
> > > >
> > > > GROUP BY T.window or ON L.window = R.window
> > > >
> > > > What do you think about this ?
> > > >
> > > > Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
> > > >
> > > > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <
> viliam@hazelcast.com>
> > > > > wrote:
> > > > >
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > >
> > > > > > It's not about how often you emit. By "emit" I mean the moment
> when
> > > the
> > > > > > rows are sent out. For example, when working with event time and
> > the
> > > > > input
> > > > > > stalls, you might emit much later. Or you can use it for
> historical
> > > > data
> > > > > in
> > > > > > which case the actual interval will be much shorter. It's the
> time
> > > > > interval
> > > > > > by which the window "hops". Two subsequent windows will be this
> > much
> > > > > apart.
> > > > > >
> > > > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented
> by
> > up
> > > > to
> > > > > 4
> > > > > > member functions: describe, start, fulfill, finish. The
> `describe`
> > > > > function
> > > > > > is supposed to be called during validation. It can validate the
> > > > > arguments.
> > > > > > In case of TUMBLE, it can check whether the column described by
> the
> > > > > > descriptor is present in the input table and if it is of a
> correct
> > > > type.
> > > > > It
> > > > > > can throw errors which will become validation errors.
> > > > >
> > > > > Yes. Descriptor has a feature to enable a type checking. If
> > considering
> > > > it,
> > > > > functions will be (argument name not decided yet):
> > > > >
> > > > > TUMBLE(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > HOP(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > It also determines
> > > > > > the actual output table type.
> > > > > >
> > > > > > I think it's worthwhile to read the PTF spec before implementing
> > > this,
> > > > > it's
> > > > > > freely available here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > > > >
> > > > > > Viliam
> > > > > >
> > > > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Thanks Danny for your summary!
> > > > > > >
> > > > > > > For question one, I prefer the ordering of parameter in option
> > two.
> > > > > > > However, for argument names, I think it might be better to
> > > > > > > 1. name the window size explicitly by "window_size".
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > > > 3. don't use "table" as the first argument name, because
> "table"
> > > is a
> > > > > > > keyword. So we can replace it by "data"
> > > > > > >
> > > > > > > so it might be
> > > > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > > > HOP(data, DESCRIPTOR(time_column), window_size,
> emit_frequency[,
> > > > > offset])
> > > > > > >
> > > > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > > > discussed
> > > > > > in
> > > > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > > > signature:
> > > > > > > TUMBLE(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > HOP(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > So those intervals have to be INTERVAL DAY TO SECOND, not
> > INTERVAL
> > > > YEAR
> > > > > > TO
> > > > > > > MONTH, because year to month varies in seconds.
> > > > > > >
> > > > > > >
> > > > > > > Regarding question two, I tried to dig into codebase and I
> think
> > it
> > > > > could
> > > > > > > be a RexCorrelVariable to replace that table_name, so it
> becomes
> > > > > > > table_function(RexCorrelVariable, DESCRIPTOR($1),
> > internal:INTERVAL
> > > > > > MINUTE)
> > > > > > >
> > > > > > > I know RexCorrelVariable might be not designed for this
> purpose,
> > > but
> > > > I
> > > > > > > find RexCorrelVariable is really useful because:
> > > > > > > 1. It can provide an id, to indicate which input of
> > > TableFunctionScan
> > > > > > that
> > > > > > > this table_function is applied on. Note that TableFunctionScan
> > > might
> > > > > not
> > > > > > > have one input.
> > > > > > > 2. It can provide RelDataType, to save the referenced input's
> > > schema.
> > > > > The
> > > > > > > input schema is useful during operator argument type validation
> > > (for
> > > > > > > DESCRIPTOR) and operator return type inference.
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > > > >
> > > > > > >
> > > > > > > -Rui
> > > > > > >
> > > > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <
> yuzhao.cyz@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > > > window
> > > > > > > table
> > > > > > > > functions to replace the legacy group window syntax, the idea
> > > comes
> > > > > > from
> > > > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > > > implementation
> > > > > > > > also got some feedbacks(or confusion), this thread tries to
> > have
> > > a
> > > > > > > > discussion on theses questions that are undecided.
> > > > > > > >
> > > > > > > > One is about the window function argument names:
> > > > > > > >
> > > > > > > > === option1
> > > > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > > > >
> > > > > > > > === option2
> > > > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > > > >
> > > > > > > > I would prefer for option2 for the reasons pointed out by
> > Viliam
> > > in
> > > > > > > > CALCITE-3737.
> > > > > > > >
> > > > > > > >
> > > > > > > > Another is about how to translate the query
> > > > > > > >
> > > > > > > > For query, select * from TUMBLE(table,
> DESCRIPTOR(time_column),
> > > > > > > interval);
> > > > > > > >
> > > > > > > > Currently our outputs plan is:
> > > > > > > >
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > > > window_end=[$3])
> > > > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER
> ORDERID,
> > > > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start,
> > TIMESTAMP(0)
> > > > > > > > window_end)])
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > > > >
> > > > > > > > But this seems not right, because the "table table_name"
> > argument
> > > > is
> > > > > > not
> > > > > > > a
> > > > > > > > relational expression, in CALCITE-3955, Julian has pointed
> out
> > > that
> > > > > we
> > > > > > > > should translate the "table" argument as a CURSOR (or TABLE
> by
> > > > > Viliam),
> > > > > > > but
> > > > > > > > another question is how to translate the referenced name
> > > > > "time_column",
> > > > > > > to
> > > > > > > > a correlate variable ? which is also confusing because there
> is
> > > no
> > > > > > > > correlation in the plan actually.
> > > > > > > >
> > > > > > > > Any suggestions are appreciated !
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Danny Chan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Viliam Durina
> > > > > > Jet Developer
> > > > > >       hazelcast®
> > > > > >
> > > > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo,
> CA
> > > > > 94402 |
> > > > > > USA
> > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453>
> <(650)%20521-5453> |
> > > > hazelcast.com <
> > > > > > https://www.hazelcast.com>
> > > > > >
> > > > > > --
> > > > > > This message contains confidential information and is intended
> only
> > > for
> > > > > > the
> > > > > > individuals named. If you are not the named addressee you should
> > not
> > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > sender
> > > > > > immediately by e-mail if you have received this e-mail by mistake
> > and
> > > > > > delete this e-mail from your system. E-mail transmission cannot
> be
> > > > > > guaranteed to be secure or error-free as information could be
> > > > > intercepted,
> > > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > > > viruses.
> > > > > > The sender therefore does not accept liability for any errors or
> > > > > omissions
> > > > > > in the contents of this message, which arise as a result of
> e-mail
> > > > > > transmission. If verification is required, please request a
> > hard-copy
> > > > > > version. -Hazelcast
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Pengcheng Liu <pe...@gmail.com>.
Hi, Danny,
   I think there is already an explanation in [1], as the author noted:
> The return value of Tumble is a relation that includes all columns of
data as well as
> additional event time columns wstart and wend. It was considered to place
the original
> row in a nested row and wstart and wend in a separate nested row, for
simplicity of
> namespacing, but that would limit these extensions to engines supporting
nested rows.

[1]
https://docs.google.com/document/d/138uA7VTpbF84CFrd--cz3YVe0-AQ9ALnsavaSE2JeE4


Danny Chan <da...@apache.org> 于2020年10月22日周四 下午7:40写道:

> Rui Wang, can you help to consult with the author of the window TVF syntax
> ? We want to hear more suggestions.
>
> cc @Rui Wang
>
> Rui Wang <am...@apache.org> 于2020年10月22日周四 下午12:01写道:
>
> > This seems reasonable to merge window metadata as a column (more compact
> > format).
> >
> > Maybe Julian can comment whether there was special consideration to
> define
> > window_start and window_end as two separate columns?
> >
> >
> > -Rui
> >
> > On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org> wrote:
> >
> > > Some of our fellows suggest the window table value functions to append
> a
> > > structure type field:
> > >
> > > window: ROW(start: timestamp(3), end: timestamp(3))
> > >
> > > For example if
> > >
> > > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> > >
> > > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> > >
> > > has return type
> > >
> > > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
> > > timestamp(3)))
> > >
> > > Thus in the aggregate key or the join condition, we can compare the
> > > structure “window” field directly, e.g.
> > >
> > > GROUP BY T.window or ON L.window = R.window
> > >
> > > What do you think about this ?
> > >
> > > Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
> > >
> > > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com>
> > > > wrote:
> > > >
> > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> this
> > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > >
> > > > > It's not about how often you emit. By "emit" I mean the moment when
> > the
> > > > > rows are sent out. For example, when working with event time and
> the
> > > > input
> > > > > stalls, you might emit much later. Or you can use it for historical
> > > data
> > > > in
> > > > > which case the actual interval will be much shorter. It's the time
> > > > interval
> > > > > by which the window "hops". Two subsequent windows will be this
> much
> > > > apart.
> > > > >
> > > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by
> up
> > > to
> > > > 4
> > > > > member functions: describe, start, fulfill, finish. The `describe`
> > > > function
> > > > > is supposed to be called during validation. It can validate the
> > > > arguments.
> > > > > In case of TUMBLE, it can check whether the column described by the
> > > > > descriptor is present in the input table and if it is of a correct
> > > type.
> > > > It
> > > > > can throw errors which will become validation errors.
> > > >
> > > > Yes. Descriptor has a feature to enable a type checking. If
> considering
> > > it,
> > > > functions will be (argument name not decided yet):
> > > >
> > > > TUMBLE(
> > > > data TABLE,
> > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > window_size INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > > HOP(
> > > > data TABLE,
> > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > window_size INTERVAL DAY TO SECOND,
> > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > >
> > > >
> > > >
> > > > > It also determines
> > > > > the actual output table type.
> > > > >
> > > > > I think it's worthwhile to read the PTF spec before implementing
> > this,
> > > > it's
> > > > > freely available here:
> > > > >
> > > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > > >
> > > > > Viliam
> > > > >
> > > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org>
> wrote:
> > > > >
> > > > > > Thanks Danny for your summary!
> > > > > >
> > > > > > For question one, I prefer the ordering of parameter in option
> two.
> > > > > > However, for argument names, I think it might be better to
> > > > > > 1. name the window size explicitly by "window_size".
> > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> this
> > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > > 3. don't use "table" as the first argument name, because "table"
> > is a
> > > > > > keyword. So we can replace it by "data"
> > > > > >
> > > > > > so it might be
> > > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> > > > offset])
> > > > > >
> > > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > > discussed
> > > > > in
> > > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > > signature:
> > > > > > TUMBLE(
> > > > > > data TABLE,
> > > > > > time_column DESCRIPTOR,
> > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > >
> > > > > > HOP(
> > > > > > data TABLE,
> > > > > > time_column DESCRIPTOR,
> > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > >
> > > > > > So those intervals have to be INTERVAL DAY TO SECOND, not
> INTERVAL
> > > YEAR
> > > > > TO
> > > > > > MONTH, because year to month varies in seconds.
> > > > > >
> > > > > >
> > > > > > Regarding question two, I tried to dig into codebase and I think
> it
> > > > could
> > > > > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > > > > table_function(RexCorrelVariable, DESCRIPTOR($1),
> internal:INTERVAL
> > > > > MINUTE)
> > > > > >
> > > > > > I know RexCorrelVariable might be not designed for this purpose,
> > but
> > > I
> > > > > > find RexCorrelVariable is really useful because:
> > > > > > 1. It can provide an id, to indicate which input of
> > TableFunctionScan
> > > > > that
> > > > > > this table_function is applied on. Note that TableFunctionScan
> > might
> > > > not
> > > > > > have one input.
> > > > > > 2. It can provide RelDataType, to save the referenced input's
> > schema.
> > > > The
> > > > > > input schema is useful during operator argument type validation
> > (for
> > > > > > DESCRIPTOR) and operator return type inference.
> > > > > >
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > > >
> > > > > >
> > > > > > -Rui
> > > > > >
> > > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > > window
> > > > > > table
> > > > > > > functions to replace the legacy group window syntax, the idea
> > comes
> > > > > from
> > > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > > implementation
> > > > > > > also got some feedbacks(or confusion), this thread tries to
> have
> > a
> > > > > > > discussion on theses questions that are undecided.
> > > > > > >
> > > > > > > One is about the window function argument names:
> > > > > > >
> > > > > > > === option1
> > > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > > >
> > > > > > > === option2
> > > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > > >
> > > > > > > I would prefer for option2 for the reasons pointed out by
> Viliam
> > in
> > > > > > > CALCITE-3737.
> > > > > > >
> > > > > > >
> > > > > > > Another is about how to translate the query
> > > > > > >
> > > > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > > > > interval);
> > > > > > >
> > > > > > > Currently our outputs plan is:
> > > > > > >
> > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > > window_end=[$3])
> > > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start,
> TIMESTAMP(0)
> > > > > > > window_end)])
> > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > > >
> > > > > > > But this seems not right, because the "table table_name"
> argument
> > > is
> > > > > not
> > > > > > a
> > > > > > > relational expression, in CALCITE-3955, Julian has pointed out
> > that
> > > > we
> > > > > > > should translate the "table" argument as a CURSOR (or TABLE by
> > > > Viliam),
> > > > > > but
> > > > > > > another question is how to translate the referenced name
> > > > "time_column",
> > > > > > to
> > > > > > > a correlate variable ? which is also confusing because there is
> > no
> > > > > > > correlation in the plan actually.
> > > > > > >
> > > > > > > Any suggestions are appreciated !
> > > > > > >
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > > >
> > > > > > > Best,
> > > > > > > Danny Chan
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Viliam Durina
> > > > > Jet Developer
> > > > >       hazelcast®
> > > > >
> > > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > > 94402 |
> > > > > USA
> > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> > > hazelcast.com <
> > > > > https://www.hazelcast.com>
> > > > >
> > > > > --
> > > > > This message contains confidential information and is intended only
> > for
> > > > > the
> > > > > individuals named. If you are not the named addressee you should
> not
> > > > > disseminate, distribute or copy this e-mail. Please notify the
> sender
> > > > > immediately by e-mail if you have received this e-mail by mistake
> and
> > > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > > guaranteed to be secure or error-free as information could be
> > > > intercepted,
> > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > > viruses.
> > > > > The sender therefore does not accept liability for any errors or
> > > > omissions
> > > > > in the contents of this message, which arise as a result of e-mail
> > > > > transmission. If verification is required, please request a
> hard-copy
> > > > > version. -Hazelcast
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Danny Chan <da...@apache.org>.
Rui Wang, can you help to consult with the author of the window TVF syntax
? We want to hear more suggestions.

cc @Rui Wang

Rui Wang <am...@apache.org> 于2020年10月22日周四 下午12:01写道:

> This seems reasonable to merge window metadata as a column (more compact
> format).
>
> Maybe Julian can comment whether there was special consideration to define
> window_start and window_end as two separate columns?
>
>
> -Rui
>
> On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org> wrote:
>
> > Some of our fellows suggest the window table value functions to append a
> > structure type field:
> >
> > window: ROW(start: timestamp(3), end: timestamp(3))
> >
> > For example if
> >
> > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> >
> > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> >
> > has return type
> >
> > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
> > timestamp(3)))
> >
> > Thus in the aggregate key or the join condition, we can compare the
> > structure “window” field directly, e.g.
> >
> > GROUP BY T.window or ON L.window = R.window
> >
> > What do you think about this ?
> >
> > Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
> >
> > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com>
> > > wrote:
> > >
> > > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > argument tries to say: e.g. emit in every x time_unit.
> > > >
> > > > It's not about how often you emit. By "emit" I mean the moment when
> the
> > > > rows are sent out. For example, when working with event time and the
> > > input
> > > > stalls, you might emit much later. Or you can use it for historical
> > data
> > > in
> > > > which case the actual interval will be much shorter. It's the time
> > > interval
> > > > by which the window "hops". Two subsequent windows will be this much
> > > apart.
> > > >
> > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up
> > to
> > > 4
> > > > member functions: describe, start, fulfill, finish. The `describe`
> > > function
> > > > is supposed to be called during validation. It can validate the
> > > arguments.
> > > > In case of TUMBLE, it can check whether the column described by the
> > > > descriptor is present in the input table and if it is of a correct
> > type.
> > > It
> > > > can throw errors which will become validation errors.
> > >
> > > Yes. Descriptor has a feature to enable a type checking. If considering
> > it,
> > > functions will be (argument name not decided yet):
> > >
> > > TUMBLE(
> > > data TABLE,
> > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > window_size INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > > HOP(
> > > data TABLE,
> > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > window_size INTERVAL DAY TO SECOND,
> > > emit_frequency INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > >
> > >
> > >
> > > > It also determines
> > > > the actual output table type.
> > > >
> > > > I think it's worthwhile to read the PTF spec before implementing
> this,
> > > it's
> > > > freely available here:
> > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > >
> > > > Viliam
> > > >
> > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:
> > > >
> > > > > Thanks Danny for your summary!
> > > > >
> > > > > For question one, I prefer the ordering of parameter in option two.
> > > > > However, for argument names, I think it might be better to
> > > > > 1. name the window size explicitly by "window_size".
> > > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > 3. don't use "table" as the first argument name, because "table"
> is a
> > > > > keyword. So we can replace it by "data"
> > > > >
> > > > > so it might be
> > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> > > offset])
> > > > >
> > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > discussed
> > > > in
> > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > signature:
> > > > > TUMBLE(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR,
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > HOP(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR,
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL
> > YEAR
> > > > TO
> > > > > MONTH, because year to month varies in seconds.
> > > > >
> > > > >
> > > > > Regarding question two, I tried to dig into codebase and I think it
> > > could
> > > > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> > > > MINUTE)
> > > > >
> > > > > I know RexCorrelVariable might be not designed for this purpose,
> but
> > I
> > > > > find RexCorrelVariable is really useful because:
> > > > > 1. It can provide an id, to indicate which input of
> TableFunctionScan
> > > > that
> > > > > this table_function is applied on. Note that TableFunctionScan
> might
> > > not
> > > > > have one input.
> > > > > 2. It can provide RelDataType, to save the referenced input's
> schema.
> > > The
> > > > > input schema is useful during operator argument type validation
> (for
> > > > > DESCRIPTOR) and operator return type inference.
> > > > >
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > >
> > > > >
> > > > > -Rui
> > > > >
> > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > window
> > > > > table
> > > > > > functions to replace the legacy group window syntax, the idea
> comes
> > > > from
> > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > implementation
> > > > > > also got some feedbacks(or confusion), this thread tries to have
> a
> > > > > > discussion on theses questions that are undecided.
> > > > > >
> > > > > > One is about the window function argument names:
> > > > > >
> > > > > > === option1
> > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > >
> > > > > > === option2
> > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > >
> > > > > > I would prefer for option2 for the reasons pointed out by Viliam
> in
> > > > > > CALCITE-3737.
> > > > > >
> > > > > >
> > > > > > Another is about how to translate the query
> > > > > >
> > > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > > > interval);
> > > > > >
> > > > > > Currently our outputs plan is:
> > > > > >
> > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > window_end=[$3])
> > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > > > > window_end)])
> > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > >
> > > > > > But this seems not right, because the "table table_name" argument
> > is
> > > > not
> > > > > a
> > > > > > relational expression, in CALCITE-3955, Julian has pointed out
> that
> > > we
> > > > > > should translate the "table" argument as a CURSOR (or TABLE by
> > > Viliam),
> > > > > but
> > > > > > another question is how to translate the referenced name
> > > "time_column",
> > > > > to
> > > > > > a correlate variable ? which is also confusing because there is
> no
> > > > > > correlation in the plan actually.
> > > > > >
> > > > > > Any suggestions are appreciated !
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > >
> > > > > > Best,
> > > > > > Danny Chan
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Viliam Durina
> > > > Jet Developer
> > > >       hazelcast®
> > > >
> > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > 94402 |
> > > > USA
> > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> > hazelcast.com <
> > > > https://www.hazelcast.com>
> > > >
> > > > --
> > > > This message contains confidential information and is intended only
> for
> > > > the
> > > > individuals named. If you are not the named addressee you should not
> > > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > > immediately by e-mail if you have received this e-mail by mistake and
> > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > guaranteed to be secure or error-free as information could be
> > > intercepted,
> > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > viruses.
> > > > The sender therefore does not accept liability for any errors or
> > > omissions
> > > > in the contents of this message, which arise as a result of e-mail
> > > > transmission. If verification is required, please request a hard-copy
> > > > version. -Hazelcast
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Jark Wu <im...@gmail.com>.
Hi Julian,

What do you think about the idea above that merging window start and window
end as a composite column?

cc @Julian

On Thu, 22 Oct 2020 at 12:01, Rui Wang <am...@apache.org> wrote:

> This seems reasonable to merge window metadata as a column (more compact
> format).
>
> Maybe Julian can comment whether there was special consideration to define
> window_start and window_end as two separate columns?
>
>
> -Rui
>
> On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org> wrote:
>
> > Some of our fellows suggest the window table value functions to append a
> > structure type field:
> >
> > window: ROW(start: timestamp(3), end: timestamp(3))
> >
> > For example if
> >
> > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> >
> > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> >
> > has return type
> >
> > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
> > timestamp(3)))
> >
> > Thus in the aggregate key or the join condition, we can compare the
> > structure “window” field directly, e.g.
> >
> > GROUP BY T.window or ON L.window = R.window
> >
> > What do you think about this ?
> >
> > Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
> >
> > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com>
> > > wrote:
> > >
> > > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > argument tries to say: e.g. emit in every x time_unit.
> > > >
> > > > It's not about how often you emit. By "emit" I mean the moment when
> the
> > > > rows are sent out. For example, when working with event time and the
> > > input
> > > > stalls, you might emit much later. Or you can use it for historical
> > data
> > > in
> > > > which case the actual interval will be much shorter. It's the time
> > > interval
> > > > by which the window "hops". Two subsequent windows will be this much
> > > apart.
> > > >
> > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up
> > to
> > > 4
> > > > member functions: describe, start, fulfill, finish. The `describe`
> > > function
> > > > is supposed to be called during validation. It can validate the
> > > arguments.
> > > > In case of TUMBLE, it can check whether the column described by the
> > > > descriptor is present in the input table and if it is of a correct
> > type.
> > > It
> > > > can throw errors which will become validation errors.
> > >
> > > Yes. Descriptor has a feature to enable a type checking. If considering
> > it,
> > > functions will be (argument name not decided yet):
> > >
> > > TUMBLE(
> > > data TABLE,
> > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > window_size INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > > HOP(
> > > data TABLE,
> > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > window_size INTERVAL DAY TO SECOND,
> > > emit_frequency INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > >
> > >
> > >
> > > > It also determines
> > > > the actual output table type.
> > > >
> > > > I think it's worthwhile to read the PTF spec before implementing
> this,
> > > it's
> > > > freely available here:
> > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > >
> > > > Viliam
> > > >
> > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:
> > > >
> > > > > Thanks Danny for your summary!
> > > > >
> > > > > For question one, I prefer the ordering of parameter in option two.
> > > > > However, for argument names, I think it might be better to
> > > > > 1. name the window size explicitly by "window_size".
> > > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > 3. don't use "table" as the first argument name, because "table"
> is a
> > > > > keyword. So we can replace it by "data"
> > > > >
> > > > > so it might be
> > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> > > offset])
> > > > >
> > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > discussed
> > > > in
> > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > signature:
> > > > > TUMBLE(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR,
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > HOP(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR,
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL
> > YEAR
> > > > TO
> > > > > MONTH, because year to month varies in seconds.
> > > > >
> > > > >
> > > > > Regarding question two, I tried to dig into codebase and I think it
> > > could
> > > > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> > > > MINUTE)
> > > > >
> > > > > I know RexCorrelVariable might be not designed for this purpose,
> but
> > I
> > > > > find RexCorrelVariable is really useful because:
> > > > > 1. It can provide an id, to indicate which input of
> TableFunctionScan
> > > > that
> > > > > this table_function is applied on. Note that TableFunctionScan
> might
> > > not
> > > > > have one input.
> > > > > 2. It can provide RelDataType, to save the referenced input's
> schema.
> > > The
> > > > > input schema is useful during operator argument type validation
> (for
> > > > > DESCRIPTOR) and operator return type inference.
> > > > >
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > >
> > > > >
> > > > > -Rui
> > > > >
> > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > window
> > > > > table
> > > > > > functions to replace the legacy group window syntax, the idea
> comes
> > > > from
> > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > implementation
> > > > > > also got some feedbacks(or confusion), this thread tries to have
> a
> > > > > > discussion on theses questions that are undecided.
> > > > > >
> > > > > > One is about the window function argument names:
> > > > > >
> > > > > > === option1
> > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > >
> > > > > > === option2
> > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > >
> > > > > > I would prefer for option2 for the reasons pointed out by Viliam
> in
> > > > > > CALCITE-3737.
> > > > > >
> > > > > >
> > > > > > Another is about how to translate the query
> > > > > >
> > > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > > > interval);
> > > > > >
> > > > > > Currently our outputs plan is:
> > > > > >
> > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > window_end=[$3])
> > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > > > > window_end)])
> > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > >
> > > > > > But this seems not right, because the "table table_name" argument
> > is
> > > > not
> > > > > a
> > > > > > relational expression, in CALCITE-3955, Julian has pointed out
> that
> > > we
> > > > > > should translate the "table" argument as a CURSOR (or TABLE by
> > > Viliam),
> > > > > but
> > > > > > another question is how to translate the referenced name
> > > "time_column",
> > > > > to
> > > > > > a correlate variable ? which is also confusing because there is
> no
> > > > > > correlation in the plan actually.
> > > > > >
> > > > > > Any suggestions are appreciated !
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > >
> > > > > > Best,
> > > > > > Danny Chan
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Viliam Durina
> > > > Jet Developer
> > > >       hazelcast®
> > > >
> > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > 94402 |
> > > > USA
> > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> > hazelcast.com <
> > > > https://www.hazelcast.com>
> > > >
> > > > --
> > > > This message contains confidential information and is intended only
> for
> > > > the
> > > > individuals named. If you are not the named addressee you should not
> > > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > > immediately by e-mail if you have received this e-mail by mistake and
> > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > guaranteed to be secure or error-free as information could be
> > > intercepted,
> > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > viruses.
> > > > The sender therefore does not accept liability for any errors or
> > > omissions
> > > > in the contents of this message, which arise as a result of e-mail
> > > > transmission. If verification is required, please request a hard-copy
> > > > version. -Hazelcast
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Rui Wang <am...@apache.org>.
This seems reasonable to merge window metadata as a column (more compact
format).

Maybe Julian can comment whether there was special consideration to define
window_start and window_end as two separate columns?


-Rui

On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <da...@apache.org> wrote:

> Some of our fellows suggest the window table value functions to append a
> structure type field:
>
> window: ROW(start: timestamp(3), end: timestamp(3))
>
> For example if
>
> table T has schema: (ts: timestamp(3), b: varchar(20)), then
>
> Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
>
> has return type
>
> Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
> timestamp(3)))
>
> Thus in the aggregate key or the join condition, we can compare the
> structure “window” field directly, e.g.
>
> GROUP BY T.window or ON L.window = R.window
>
> What do you think about this ?
>
> Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:
>
> > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com>
> > wrote:
> >
> > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > argument tries to say: e.g. emit in every x time_unit.
> > >
> > > It's not about how often you emit. By "emit" I mean the moment when the
> > > rows are sent out. For example, when working with event time and the
> > input
> > > stalls, you might emit much later. Or you can use it for historical
> data
> > in
> > > which case the actual interval will be much shorter. It's the time
> > interval
> > > by which the window "hops". Two subsequent windows will be this much
> > apart.
> > >
> > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up
> to
> > 4
> > > member functions: describe, start, fulfill, finish. The `describe`
> > function
> > > is supposed to be called during validation. It can validate the
> > arguments.
> > > In case of TUMBLE, it can check whether the column described by the
> > > descriptor is present in the input table and if it is of a correct
> type.
> > It
> > > can throw errors which will become validation errors.
> >
> > Yes. Descriptor has a feature to enable a type checking. If considering
> it,
> > functions will be (argument name not decided yet):
> >
> > TUMBLE(
> > data TABLE,
> > time_column DESCRIPTOR(*TIMESTAMP*),
> > window_size INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> > HOP(
> > data TABLE,
> > time_column DESCRIPTOR(*TIMESTAMP*),
> > window_size INTERVAL DAY TO SECOND,
> > emit_frequency INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> >
> >
> >
> > > It also determines
> > > the actual output table type.
> > >
> > > I think it's worthwhile to read the PTF spec before implementing this,
> > it's
> > > freely available here:
> > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > >
> > > Viliam
> > >
> > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:
> > >
> > > > Thanks Danny for your summary!
> > > >
> > > > For question one, I prefer the ordering of parameter in option two.
> > > > However, for argument names, I think it might be better to
> > > > 1. name the window size explicitly by "window_size".
> > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > argument tries to say: e.g. emit in every x time_unit.
> > > > 3. don't use "table" as the first argument name, because "table" is a
> > > > keyword. So we can replace it by "data"
> > > >
> > > > so it might be
> > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> > offset])
> > > >
> > > > Meanwhile, one thing worths mentioning is, like Viliam always
> discussed
> > > in
> > > > CALCITE-4000[1], we can also finalize the data types for each
> > signature:
> > > > TUMBLE(
> > > > data TABLE,
> > > > time_column DESCRIPTOR,
> > > > window_size INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > > HOP(
> > > > data TABLE,
> > > > time_column DESCRIPTOR,
> > > > window_size INTERVAL DAY TO SECOND,
> > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL
> YEAR
> > > TO
> > > > MONTH, because year to month varies in seconds.
> > > >
> > > >
> > > > Regarding question two, I tried to dig into codebase and I think it
> > could
> > > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> > > MINUTE)
> > > >
> > > > I know RexCorrelVariable might be not designed for this purpose, but
> I
> > > > find RexCorrelVariable is really useful because:
> > > > 1. It can provide an id, to indicate which input of TableFunctionScan
> > > that
> > > > this table_function is applied on. Note that TableFunctionScan might
> > not
> > > > have one input.
> > > > 2. It can provide RelDataType, to save the referenced input's schema.
> > The
> > > > input schema is useful during operator argument type validation (for
> > > > DESCRIPTOR) and operator return type inference.
> > > >
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > >
> > > >
> > > > -Rui
> > > >
> > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com>
> > wrote:
> > > >
> > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> window
> > > > table
> > > > > functions to replace the legacy group window syntax, the idea comes
> > > from
> > > > > the SQL-2016 polymorphic table functions. But the current
> > > implementation
> > > > > also got some feedbacks(or confusion), this thread tries to have a
> > > > > discussion on theses questions that are undecided.
> > > > >
> > > > > One is about the window function argument names:
> > > > >
> > > > > === option1
> > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > >
> > > > > === option2
> > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > >
> > > > > I would prefer for option2 for the reasons pointed out by Viliam in
> > > > > CALCITE-3737.
> > > > >
> > > > >
> > > > > Another is about how to translate the query
> > > > >
> > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > > interval);
> > > > >
> > > > > Currently our outputs plan is:
> > > > >
> > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > window_end=[$3])
> > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > > > window_end)])
> > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > >
> > > > > But this seems not right, because the "table table_name" argument
> is
> > > not
> > > > a
> > > > > relational expression, in CALCITE-3955, Julian has pointed out that
> > we
> > > > > should translate the "table" argument as a CURSOR (or TABLE by
> > Viliam),
> > > > but
> > > > > another question is how to translate the referenced name
> > "time_column",
> > > > to
> > > > > a correlate variable ? which is also confusing because there is no
> > > > > correlation in the plan actually.
> > > > >
> > > > > Any suggestions are appreciated !
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > >
> > > > > Best,
> > > > > Danny Chan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Viliam Durina
> > > Jet Developer
> > >       hazelcast®
> > >
> > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > 94402 |
> > > USA
> > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> hazelcast.com <
> > > https://www.hazelcast.com>
> > >
> > > --
> > > This message contains confidential information and is intended only for
> > > the
> > > individuals named. If you are not the named addressee you should not
> > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > immediately by e-mail if you have received this e-mail by mistake and
> > > delete this e-mail from your system. E-mail transmission cannot be
> > > guaranteed to be secure or error-free as information could be
> > intercepted,
> > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > viruses.
> > > The sender therefore does not accept liability for any errors or
> > omissions
> > > in the contents of this message, which arise as a result of e-mail
> > > transmission. If verification is required, please request a hard-copy
> > > version. -Hazelcast
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Danny Chan <da...@apache.org>.
Some of our fellows suggest the window table value functions to append a
structure type field:

window: ROW(start: timestamp(3), end: timestamp(3))

For example if

table T has schema: (ts: timestamp(3), b: varchar(20)), then

Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)

has return type

Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
timestamp(3)))

Thus in the aggregate key or the join condition, we can compare the
structure “window” field directly, e.g.

GROUP BY T.window or ON L.window = R.window

What do you think about this ?

Rui Wang <am...@apache.org> 于2020年6月9日周二 上午1:34写道:

> On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com>
> wrote:
>
> > > 2. rename the hop_size to emit_frequency. Because that's what this
> > argument tries to say: e.g. emit in every x time_unit.
> >
> > It's not about how often you emit. By "emit" I mean the moment when the
> > rows are sent out. For example, when working with event time and the
> input
> > stalls, you might emit much later. Or you can use it for historical data
> in
> > which case the actual interval will be much shorter. It's the time
> interval
> > by which the window "hops". Two subsequent windows will be this much
> apart.
> >
> > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up to
> 4
> > member functions: describe, start, fulfill, finish. The `describe`
> function
> > is supposed to be called during validation. It can validate the
> arguments.
> > In case of TUMBLE, it can check whether the column described by the
> > descriptor is present in the input table and if it is of a correct type.
> It
> > can throw errors which will become validation errors.
>
> Yes. Descriptor has a feature to enable a type checking. If considering it,
> functions will be (argument name not decided yet):
>
> TUMBLE(
> data TABLE,
> time_column DESCRIPTOR(*TIMESTAMP*),
> window_size INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
> HOP(
> data TABLE,
> time_column DESCRIPTOR(*TIMESTAMP*),
> window_size INTERVAL DAY TO SECOND,
> emit_frequency INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
>
>
>
> > It also determines
> > the actual output table type.
> >
> > I think it's worthwhile to read the PTF spec before implementing this,
> it's
> > freely available here:
> >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> >
> > Viliam
> >
> > On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:
> >
> > > Thanks Danny for your summary!
> > >
> > > For question one, I prefer the ordering of parameter in option two.
> > > However, for argument names, I think it might be better to
> > > 1. name the window size explicitly by "window_size".
> > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > argument tries to say: e.g. emit in every x time_unit.
> > > 3. don't use "table" as the first argument name, because "table" is a
> > > keyword. So we can replace it by "data"
> > >
> > > so it might be
> > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> offset])
> > >
> > > Meanwhile, one thing worths mentioning is, like Viliam always discussed
> > in
> > > CALCITE-4000[1], we can also finalize the data types for each
> signature:
> > > TUMBLE(
> > > data TABLE,
> > > time_column DESCRIPTOR,
> > > window_size INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > > HOP(
> > > data TABLE,
> > > time_column DESCRIPTOR,
> > > window_size INTERVAL DAY TO SECOND,
> > > emit_frequency INTERVAL DAY TO SECOND,
> > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > >
> > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR
> > TO
> > > MONTH, because year to month varies in seconds.
> > >
> > >
> > > Regarding question two, I tried to dig into codebase and I think it
> could
> > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> > MINUTE)
> > >
> > > I know RexCorrelVariable might be not designed for this purpose, but I
> > > find RexCorrelVariable is really useful because:
> > > 1. It can provide an id, to indicate which input of TableFunctionScan
> > that
> > > this table_function is applied on. Note that TableFunctionScan might
> not
> > > have one input.
> > > 2. It can provide RelDataType, to save the referenced input's schema.
> The
> > > input schema is useful during operator argument type validation (for
> > > DESCRIPTOR) and operator return type inference.
> > >
> > >
> > > [1]:
> > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > >
> > >
> > > -Rui
> > >
> > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com>
> wrote:
> > >
> > > > Hi, all, in the last 2 releases, Rui Wang has contributed the window
> > > table
> > > > functions to replace the legacy group window syntax, the idea comes
> > from
> > > > the SQL-2016 polymorphic table functions. But the current
> > implementation
> > > > also got some feedbacks(or confusion), this thread tries to have a
> > > > discussion on theses questions that are undecided.
> > > >
> > > > One is about the window function argument names:
> > > >
> > > > === option1
> > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > >
> > > > === option2
> > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > >
> > > > I would prefer for option2 for the reasons pointed out by Viliam in
> > > > CALCITE-3737.
> > > >
> > > >
> > > > Another is about how to translate the query
> > > >
> > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > interval);
> > > >
> > > > Currently our outputs plan is:
> > > >
> > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > window_end=[$3])
> > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > > window_end)])
> > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > >
> > > > But this seems not right, because the "table table_name" argument is
> > not
> > > a
> > > > relational expression, in CALCITE-3955, Julian has pointed out that
> we
> > > > should translate the "table" argument as a CURSOR (or TABLE by
> Viliam),
> > > but
> > > > another question is how to translate the referenced name
> "time_column",
> > > to
> > > > a correlate variable ? which is also confusing because there is no
> > > > correlation in the plan actually.
> > > >
> > > > Any suggestions are appreciated !
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > >
> >
> >
> > --
> > Viliam Durina
> > Jet Developer
> >       hazelcast®
> >
> >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> 94402 |
> > USA
> > +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com <
> > https://www.hazelcast.com>
> >
> > --
> > This message contains confidential information and is intended only for
> > the
> > individuals named. If you are not the named addressee you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately by e-mail if you have received this e-mail by mistake and
> > delete this e-mail from your system. E-mail transmission cannot be
> > guaranteed to be secure or error-free as information could be
> intercepted,
> > corrupted, lost, destroyed, arrive late or incomplete, or contain
> viruses.
> > The sender therefore does not accept liability for any errors or
> omissions
> > in the contents of this message, which arise as a result of e-mail
> > transmission. If verification is required, please request a hard-copy
> > version. -Hazelcast
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Rui Wang <am...@apache.org>.
On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vi...@hazelcast.com> wrote:

> > 2. rename the hop_size to emit_frequency. Because that's what this
> argument tries to say: e.g. emit in every x time_unit.
>
> It's not about how often you emit. By "emit" I mean the moment when the
> rows are sent out. For example, when working with event time and the input
> stalls, you might emit much later. Or you can use it for historical data in
> which case the actual interval will be much shorter. It's the time interval
> by which the window "hops". Two subsequent windows will be this much apart.
>
> Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up to 4
> member functions: describe, start, fulfill, finish. The `describe` function
> is supposed to be called during validation. It can validate the arguments.
> In case of TUMBLE, it can check whether the column described by the
> descriptor is present in the input table and if it is of a correct type. It
> can throw errors which will become validation errors.

Yes. Descriptor has a feature to enable a type checking. If considering it,
functions will be (argument name not decided yet):

TUMBLE(
data TABLE,
time_column DESCRIPTOR(*TIMESTAMP*),
window_size INTERVAL DAY TO SECOND,
offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)

HOP(
data TABLE,
time_column DESCRIPTOR(*TIMESTAMP*),
window_size INTERVAL DAY TO SECOND,
emit_frequency INTERVAL DAY TO SECOND,
offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)




> It also determines
> the actual output table type.
>
> I think it's worthwhile to read the PTF spec before implementing this, it's
> freely available here:
>
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
>
> Viliam
>
> On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:
>
> > Thanks Danny for your summary!
> >
> > For question one, I prefer the ordering of parameter in option two.
> > However, for argument names, I think it might be better to
> > 1. name the window size explicitly by "window_size".
> > 2. rename the hop_size to emit_frequency. Because that's what this
> > argument tries to say: e.g. emit in every x time_unit.
> > 3. don't use "table" as the first argument name, because "table" is a
> > keyword. So we can replace it by "data"
> >
> > so it might be
> > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[, offset])
> >
> > Meanwhile, one thing worths mentioning is, like Viliam always discussed
> in
> > CALCITE-4000[1], we can also finalize the data types for each signature:
> > TUMBLE(
> > data TABLE,
> > time_column DESCRIPTOR,
> > window_size INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> > HOP(
> > data TABLE,
> > time_column DESCRIPTOR,
> > window_size INTERVAL DAY TO SECOND,
> > emit_frequency INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR
> TO
> > MONTH, because year to month varies in seconds.
> >
> >
> > Regarding question two, I tried to dig into codebase and I think it could
> > be a RexCorrelVariable to replace that table_name, so it becomes
> > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> MINUTE)
> >
> > I know RexCorrelVariable might be not designed for this purpose, but I
> > find RexCorrelVariable is really useful because:
> > 1. It can provide an id, to indicate which input of TableFunctionScan
> that
> > this table_function is applied on. Note that TableFunctionScan might not
> > have one input.
> > 2. It can provide RelDataType, to save the referenced input's schema. The
> > input schema is useful during operator argument type validation (for
> > DESCRIPTOR) and operator return type inference.
> >
> >
> > [1]:
> >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> >
> >
> > -Rui
> >
> > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com> wrote:
> >
> > > Hi, all, in the last 2 releases, Rui Wang has contributed the window
> > table
> > > functions to replace the legacy group window syntax, the idea comes
> from
> > > the SQL-2016 polymorphic table functions. But the current
> implementation
> > > also got some feedbacks(or confusion), this thread tries to have a
> > > discussion on theses questions that are undecided.
> > >
> > > One is about the window function argument names:
> > >
> > > === option1
> > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > >
> > > === option2
> > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > >
> > > I would prefer for option2 for the reasons pointed out by Viliam in
> > > CALCITE-3737.
> > >
> > >
> > > Another is about how to translate the query
> > >
> > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > interval);
> > >
> > > Currently our outputs plan is:
> > >
> > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > window_end=[$3])
> > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > window_end)])
> > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > >
> > > But this seems not right, because the "table table_name" argument is
> not
> > a
> > > relational expression, in CALCITE-3955, Julian has pointed out that we
> > > should translate the "table" argument as a CURSOR (or TABLE by Viliam),
> > but
> > > another question is how to translate the referenced name "time_column",
> > to
> > > a correlate variable ? which is also confusing because there is no
> > > correlation in the plan actually.
> > >
> > > Any suggestions are appreciated !
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > >
> > > Best,
> > > Danny Chan
> > >
> >
>
>
> --
> Viliam Durina
> Jet Developer
>       hazelcast®
>
>   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
> USA
> +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com <
> https://www.hazelcast.com>
>
> --
> This message contains confidential information and is intended only for
> the
> individuals named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. If verification is required, please request a hard-copy
> version. -Hazelcast
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Viliam Durina <vi...@hazelcast.com>.
> 2. rename the hop_size to emit_frequency. Because that's what this
argument tries to say: e.g. emit in every x time_unit.

It's not about how often you emit. By "emit" I mean the moment when the
rows are sent out. For example, when working with event time and the input
stalls, you might emit much later. Or you can use it for historical data in
which case the actual interval will be much shorter. It's the time interval
by which the window "hops". Two subsequent windows will be this much apart.

Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up to 4
member functions: describe, start, fulfill, finish. The `describe` function
is supposed to be called during validation. It can validate the arguments.
In case of TUMBLE, it can check whether the column described by the
descriptor is present in the input table and if it is of a correct type. It
can throw errors which will become validation errors. It also determines
the actual output table type.

I think it's worthwhile to read the PTF spec before implementing this, it's
freely available here:
https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip

Viliam

On Sun, 7 Jun 2020 at 06:23, Rui Wang <am...@apache.org> wrote:

> Thanks Danny for your summary!
>
> For question one, I prefer the ordering of parameter in option two.
> However, for argument names, I think it might be better to
> 1. name the window size explicitly by "window_size".
> 2. rename the hop_size to emit_frequency. Because that's what this
> argument tries to say: e.g. emit in every x time_unit.
> 3. don't use "table" as the first argument name, because "table" is a
> keyword. So we can replace it by "data"
>
> so it might be
> TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[, offset])
>
> Meanwhile, one thing worths mentioning is, like Viliam always discussed in
> CALCITE-4000[1], we can also finalize the data types for each signature:
> TUMBLE(
> data TABLE,
> time_column DESCRIPTOR,
> window_size INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
> HOP(
> data TABLE,
> time_column DESCRIPTOR,
> window_size INTERVAL DAY TO SECOND,
> emit_frequency INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
> So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR TO
> MONTH, because year to month varies in seconds.
>
>
> Regarding question two, I tried to dig into codebase and I think it could
> be a RexCorrelVariable to replace that table_name, so it becomes
> table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL MINUTE)
>
> I know RexCorrelVariable might be not designed for this purpose, but I
> find RexCorrelVariable is really useful because:
> 1. It can provide an id, to indicate which input of TableFunctionScan that
> this table_function is applied on. Note that TableFunctionScan might not
> have one input.
> 2. It can provide RelDataType, to save the referenced input's schema. The
> input schema is useful during operator argument type validation (for
> DESCRIPTOR) and operator return type inference.
>
>
> [1]:
>
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
>
>
> -Rui
>
> On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com> wrote:
>
> > Hi, all, in the last 2 releases, Rui Wang has contributed the window
> table
> > functions to replace the legacy group window syntax, the idea comes from
> > the SQL-2016 polymorphic table functions. But the current implementation
> > also got some feedbacks(or confusion), this thread tries to have a
> > discussion on theses questions that are undecided.
> >
> > One is about the window function argument names:
> >
> > === option1
> > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > HOP(table, DESCRIPTOR(time_column), slide, size)
> >
> > === option2
> > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> >
> > I would prefer for option2 for the reasons pointed out by Viliam in
> > CALCITE-3737.
> >
> >
> > Another is about how to translate the query
> >
> > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> interval);
> >
> > Currently our outputs plan is:
> >
> > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > window_end=[$3])
> > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > window_end)])
> > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> >
> > But this seems not right, because the "table table_name" argument is not
> a
> > relational expression, in CALCITE-3955, Julian has pointed out that we
> > should translate the "table" argument as a CURSOR (or TABLE by Viliam),
> but
> > another question is how to translate the referenced name "time_column",
> to
> > a correlate variable ? which is also confusing because there is no
> > correlation in the plan actually.
> >
> > Any suggestions are appreciated !
> >
> >
> > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> >
> > Best,
> > Danny Chan
> >
>


-- 
Viliam Durina
Jet Developer
      hazelcast®

  <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
USA
+1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Posted by Rui Wang <am...@apache.org>.
Thanks Danny for your summary!

For question one, I prefer the ordering of parameter in option two.
However, for argument names, I think it might be better to
1. name the window size explicitly by "window_size".
2. rename the hop_size to emit_frequency. Because that's what this
argument tries to say: e.g. emit in every x time_unit.
3. don't use "table" as the first argument name, because "table" is a
keyword. So we can replace it by "data"

so it might be
TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[, offset])

Meanwhile, one thing worths mentioning is, like Viliam always discussed in
CALCITE-4000[1], we can also finalize the data types for each signature:
TUMBLE(
data TABLE,
time_column DESCRIPTOR,
window_size INTERVAL DAY TO SECOND,
offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)

HOP(
data TABLE,
time_column DESCRIPTOR,
window_size INTERVAL DAY TO SECOND,
emit_frequency INTERVAL DAY TO SECOND,
offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)

So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR TO
MONTH, because year to month varies in seconds.


Regarding question two, I tried to dig into codebase and I think it could
be a RexCorrelVariable to replace that table_name, so it becomes
table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL MINUTE)

I know RexCorrelVariable might be not designed for this purpose, but I
find RexCorrelVariable is really useful because:
1. It can provide an id, to indicate which input of TableFunctionScan that
this table_function is applied on. Note that TableFunctionScan might not
have one input.
2. It can provide RelDataType, to save the referenced input's schema. The
input schema is useful during operator argument type validation (for
DESCRIPTOR) and operator return type inference.


[1]:
https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665


-Rui

On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yu...@gmail.com> wrote:

> Hi, all, in the last 2 releases, Rui Wang has contributed the window table
> functions to replace the legacy group window syntax, the idea comes from
> the SQL-2016 polymorphic table functions. But the current implementation
> also got some feedbacks(or confusion), this thread tries to have a
> discussion on theses questions that are undecided.
>
> One is about the window function argument names:
>
> === option1
> TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> HOP(table, DESCRIPTOR(time_column), slide, size)
>
> === option2
> TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
>
> I would prefer for option2 for the reasons pointed out by Viliam in
> CALCITE-3737.
>
>
> Another is about how to translate the query
>
> For query, select * from TUMBLE(table, DESCRIPTOR(time_column), interval);
>
> Currently our outputs plan is:
>
> LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> window_end=[$3])
> LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> window_end)])
> LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
>
> But this seems not right, because the "table table_name" argument is not a
> relational expression, in CALCITE-3955, Julian has pointed out that we
> should translate the "table" argument as a CURSOR (or TABLE by Viliam), but
> another question is how to translate the referenced name "time_column", to
> a correlate variable ? which is also confusing because there is no
> correlation in the plan actually.
>
> Any suggestions are appreciated !
>
>
> [1] https://issues.apache.org/jira/browse/CALCITE-3955
> [2] https://issues.apache.org/jira/browse/CALCITE-3737
>
> Best,
> Danny Chan
>