You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2015/06/25 02:38:04 UTC

[jira] [Commented] (CALCITE-768) Hopping windows

    [ https://issues.apache.org/jira/browse/CALCITE-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600437#comment-14600437 ] 

Julian Hyde commented on CALCITE-768:
-------------------------------------

See related email thread http://mail-archives.apache.org/mod_mbox/incubator-calcite-dev/201506.mbox/%3CCAPSgeETxusddoevOpUBW4mfrRhRSHVOX_jOKzc-Jdcvpo3tq4A%40mail.gmail.com%3E.

> Hopping windows
> ---------------
>
>                 Key: CALCITE-768
>                 URL: https://issues.apache.org/jira/browse/CALCITE-768
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>
> Recall that a hopping window emits a sub-total every X seconds of
> records that have arrived over the last Y seconds. A tumbling window
> is a hopping window where X and Y are equal.
> In https://calcite.incubator.apache.org/docs/stream.html#hopping-windows
> I give an example, "emit, every hour, the number of each product
> ordered over the past three hours".
> That example gives a query in terms of a GROUP BY (in the HourlyTotals
> view) followed by a moving sum. I didn't think that it was possible to
> express using just one GROUP BY, because that would violate one of the
> principles of SQL: that each record entering a GROUP BY contributes to
> precisely one output record.
> But I've just realized that the CUBE, ROLLUP and GROUPING SETS
> operators (already in SQL) violate that principle. And if they can do
> it, we can do the same. So we could add another grouping function,
> HOP(t, emit, retain).
> The query would look like this:
> {code}
> SELECT STREAM START(rowtime) AS rowtime,
>   productId,
>   SUM(units) AS sumUnits,
>   COUNT(*) AS c
> FROM Orders
> GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '3' HOUR),
>   productId
> {code}
> Much nicer than the one in stream.html!
> The "trick" is that the HOP function is returning a list of rowtime
> values. For example, for row 1 {rowtime: '09:33', ...} it will return
> ['09:00', '10:00', '11:00']; for row 2 {rowtime: '10:05', ...} it will
> return ['10:00', '11:00', '12:00']. The system adds each row to
> several sub-totals, and emits each sub-total when it is complete. The
> sub-total for '09:00' will contain only row 1, and will be emitted at
> '10:00'; the sub-total for '10:00' will contain row 1 and row 2, and
> will be emitted at '11:00', and so forth.
> Returning multiple values is related to the flatMap function in Spark
> (and earlier selectMany in LINQ) and makes HOP's semantics similar to
> GROUPING SETS and therefore sound.
> START is a new aggregate function that returns the lower bound of the
> current sub-total; END similarly.
> Note that the "retain" argument does not need to be a whole multiple
> of the "emit" argument. This was a major limitation in the previous
> proposal.
> There are some straightforward extensions:
> * Define a TUMBLE function;
> * Add an "align" argument to HOP, to allow windows to start at, say, 5
> minutes past each hour;
> * Apply HOP to windows based on row-counts;
> * Allow user-defined windowing functions that similarly return a list
> of interval start-end points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)