You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2015/06/25 00:38:04 UTC

[jira] [Created] (CALCITE-768) Hopping windows

Julian Hyde created CALCITE-768:
-----------------------------------

             Summary: Hopping windows
                 Key: CALCITE-768
                 URL: https://issues.apache.org/jira/browse/CALCITE-768
             Project: Calcite
          Issue Type: Bug
            Reporter: Julian Hyde
            Assignee: Julian Hyde


Recall that a hopping window emits a sub-total every X seconds of
records that have arrived over the last Y seconds. A tumbling window
is a hopping window where X and Y are equal.

In https://calcite.incubator.apache.org/docs/stream.html#hopping-windows
I give an example, "emit, every hour, the number of each product
ordered over the past three hours".

That example gives a query in terms of a GROUP BY (in the HourlyTotals
view) followed by a moving sum. I didn't think that it was possible to
express using just one GROUP BY, because that would violate one of the
principles of SQL: that each record entering a GROUP BY contributes to
precisely one output record.

But I've just realized that the CUBE, ROLLUP and GROUPING SETS
operators (already in SQL) violate that principle. And if they can do
it, we can do the same. So we could add another grouping function,
HOP(t, emit, retain).

The query would look like this:

{code}
SELECT STREAM START(rowtime) AS rowtime,
  productId,
  SUM(units) AS sumUnits,
  COUNT(*) AS c
FROM Orders
GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '3' HOUR),
  productId
{code}

Much nicer than the one in stream.html!

The "trick" is that the HOP function is returning a list of rowtime
values. For example, for row 1 {rowtime: '09:33', ...} it will return
['09:00', '10:00', '11:00']; for row 2 {rowtime: '10:05', ...} it will
return ['10:00', '11:00', '12:00']. The system adds each row to
several sub-totals, and emits each sub-total when it is complete. The
sub-total for '09:00' will contain only row 1, and will be emitted at
'10:00'; the sub-total for '10:00' will contain row 1 and row 2, and
will be emitted at '11:00', and so forth.

Returning multiple values is related to the flatMap function in Spark
(and earlier selectMany in LINQ) and makes HOP's semantics similar to
GROUPING SETS and therefore sound.

START is a new aggregate function that returns the lower bound of the
current sub-total; END similarly.

Note that the "retain" argument does not need to be a whole multiple
of the "emit" argument. This was a major limitation in the previous
proposal.

There are some straightforward extensions:
* Define a TUMBLE function;
* Add an "align" argument to HOP, to allow windows to start at, say, 5
minutes past each hour;
* Apply HOP to windows based on row-counts;
* Allow user-defined windowing functions that similarly return a list
of interval start-end points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)