You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Mohamed Mahmoud (El-Geish) (JIRA)" <ji...@apache.org> on 2015/04/17 03:47:58 UTC

[jira] [Commented] (SAMZA-551) SQL grammar support for window operator

    [ https://issues.apache.org/jira/browse/SAMZA-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499116#comment-14499116 ] 

Mohamed Mahmoud (El-Geish) commented on SAMZA-551:
--------------------------------------------------

Even though window operators are not going to be supported in early versions, I suggest that the keywords that could be used in this context should be reserved for future use. 

> SQL grammar support for window operator
> ---------------------------------------
>
>                 Key: SAMZA-551
>                 URL: https://issues.apache.org/jira/browse/SAMZA-551
>             Project: Samza
>          Issue Type: Sub-task
>          Components: sql
>    Affects Versions: 0.9.0
>            Reporter: Yi Pan (Data Infrastructure)
>
> Consider that we want to have a count of stock trades (as a infinite stream) happened in the last hour, but only every 11min. It is easy to write the first part in sqlstream as:
> {code}
>    SELECT STREAM rowtime, count(*) OVER (ORDER BY rowtime RANGE INTERVAL '1' HOUR PROCEDING)
>        FROM Trades
> {code}
> The above will create a stream of counts that happened every hour continuously as rows are scanned. 
> Now here is the question:
> # how do we have the count every 11min instead of as the row comes in? As we discussed before, there are examples that we can create by doing truncating / grouping on the rowtime to "sample" the continuous moving counting window to get a count every 11min. But that has two issues:
> ** From implementation point of view, there is no efficiency improvement since the system still computes the count for each and every row comes in
> ** If Samza implement a more efficient tumbling window operator, there is no easy way to identify the section of SQL statement that can map to the more efficient tumbling window operator, as the sampling is done via math / group-by aggregation instead of window spec
> # if there is no row in Trades between 12:00pm to 2:00pm, how do we tell the system to still generate 0 counts for the time moments: 12:11pm, 12:22pm, 12:33pm, etc.? Or, those rows are delayed in the delivery in the system and user wants to ignore late-arrival of messages after 5min timeout to close the counting window? How can we support that use case w/o breaking SQL grammar?
> Both the above issues seem to require some extension to the window spec in SQL grammar. Julian, what do you think? Is it creating too many language/parser/planner problems in SQL?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)