You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 22:09:38 UTC

[GitHub] [beam] kennknowles opened a new issue, #19075: After SQL GROUP BY the result should be globally windowed

kennknowles opened a new issue, #19075:
URL: https://github.com/apache/beam/issues/19075

   Beam SQL runs in two contexts:
   
   1. As a PTransform in a pipeline. A PTransform operates on a PCollection, which is always implicitly windows and a PTransform should operate per-window so it automatically works on bounded and unbounded data. This only works if the query has no windowing operators, in which case the GROUP BY <non-window stuff\> should operate per-window.
   2. As a top-level shell that starts and ends with SQL. In the relational model there are no implicit windows. Calcite has some extensions for windowing, but they manifest (IMO correctly) as just items in the GROUP BY list. The output of the aggregation is "just rows" again. So it should be globally windowed.
   
   The problem is that this semantic fix makes it so we cannot join windowing stream subqueries. Because we don't have retractions, we only support GroupByKey-based equijoins over windowed streams, with the default trigger. _These joins implicitly also join windows_. For example:
   
   ```
   
   JOIN(left.id = right.id)
     SELECT ... GROUP BY id, TUMBLE(1 hour)
     SELECT ... GROUP BY id, TUMBLE(1
   hour)  
   
   ```
   
   
   Semantically, there may be a joined row for 1:00pm on the left and 10:00pm on the right. But by the time the right-hand row for 10:00pm shows up, the left one may be GC'd. So this is implicitly, but nondeterministically, joining on the window as well. Before this PR, we left the windowing strategies for left and right in place, and asserted that they matched.
   
   If we re-window into the global window always, there _are no windowed streams_ so you just can't do stream joins. The solution is probably to track which field of a stream is the window and allow joins which also explicitly express the equijoin over the window field.
   
   Imported from Jira [BEAM-4702](https://issues.apache.org/jira/browse/BEAM-4702). Original Jira may contain additional context.
   Reported by: kenn.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org