You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Kenneth Knowles (JIRA)" <ji...@apache.org> on 2018/07/02 17:49:00 UTC

[jira] [Updated] (BEAM-4702) After SQL GROUP BY the result should be globally windowed

     [ https://issues.apache.org/jira/browse/BEAM-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-4702:
----------------------------------
    Description: 
Beam SQL runs in two contexts:

1. As a PTransform in a pipeline. A PTransform operates on a PCollection, which is always implicitly windows and a PTransform should operate per-window so it automatically works on bounded and unbounded data. This only works if the query has no windowing operators, in which case the GROUP BY <non-window stuff> should operate per-window.
2. As a top-level shell that starts and ends with SQL. In the relational model there are no implicit windows. Calcite has some extensions for windowing, but they manifest (IMO correctly) as just items in the GROUP BY list. The output of the aggregation is "just rows" again. So it should be globally windowed.

The problem is that this semantic fix makes it so we cannot join windowing stream subqueries. Because we don't have retractions, we only support GroupByKey-based equijoins over windowed streams, with the default trigger. _These joins implicitly also join windows_. For example:

{code}
JOIN(left.id = right.id)
  SELECT ... GROUP BY id, TUMBLE(1 hour)
  SELECT ... GROUP BY id, TUMBLE(1 hour)  
{code}

Semantically, there may be a joined row for 1:00pm on the left and 10:00pm on the right. But by the time the right-hand row for 10:00pm shows up, the left one may be GC'd. So this is implicitly, but nondeterministically, joining on the window as well. Before this PR, we left the windowing strategies for left and right in place, and asserted that they matched.

If we re-window into the global window always, there _are no windowed streams_ so you just can't do stream joins. The solution is probably to track which field of a stream is the window and allow joins which also explicitly express the equijoin over the window field.

> After SQL GROUP BY <windowing> the result should be globally windowed
> ---------------------------------------------------------------------
>
>                 Key: BEAM-4702
>                 URL: https://issues.apache.org/jira/browse/BEAM-4702
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-sql
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Beam SQL runs in two contexts:
> 1. As a PTransform in a pipeline. A PTransform operates on a PCollection, which is always implicitly windows and a PTransform should operate per-window so it automatically works on bounded and unbounded data. This only works if the query has no windowing operators, in which case the GROUP BY <non-window stuff> should operate per-window.
> 2. As a top-level shell that starts and ends with SQL. In the relational model there are no implicit windows. Calcite has some extensions for windowing, but they manifest (IMO correctly) as just items in the GROUP BY list. The output of the aggregation is "just rows" again. So it should be globally windowed.
> The problem is that this semantic fix makes it so we cannot join windowing stream subqueries. Because we don't have retractions, we only support GroupByKey-based equijoins over windowed streams, with the default trigger. _These joins implicitly also join windows_. For example:
> {code}
> JOIN(left.id = right.id)
>   SELECT ... GROUP BY id, TUMBLE(1 hour)
>   SELECT ... GROUP BY id, TUMBLE(1 hour)  
> {code}
> Semantically, there may be a joined row for 1:00pm on the left and 10:00pm on the right. But by the time the right-hand row for 10:00pm shows up, the left one may be GC'd. So this is implicitly, but nondeterministically, joining on the window as well. Before this PR, we left the windowing strategies for left and right in place, and asserted that they matched.
> If we re-window into the global window always, there _are no windowed streams_ so you just can't do stream joins. The solution is probably to track which field of a stream is the window and allow joins which also explicitly express the equijoin over the window field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)