You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/11/09 04:06:45 UTC

[GitHub] [beam] melap commented on a change in pull request #15778: [BEAM-11758] Update basics page: Window, Watermark

melap commented on a change in pull request #15778:
URL: https://github.com/apache/beam/pull/15778#discussion_r745222397



##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,

Review comment:
       šŸ‘ 

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -42,6 +42,14 @@ understand an important set of core concepts:
    them to a runner.
  * [_Runner_](#runner) - A runner runs a Beam pipeline using the capabilities of
    your chosen data processing engine.
+ * [_Window_](#window) - A `PCollection` can be subdivided into windows based on

Review comment:
       Yeah, I've tried to keep things very simple here, and let the programming guide content cover the gory details.  I'll leave it for now unless there are objections, and we can always tweak it later if it's causing confusion.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,
+work implicitly on a per-window basis; they process each `PCollection` as a
+succession of multiple, finite windows, though the entire collection itself may
+be of unbounded size.
+
+Beam provides several windowing functions:
+
+ * **Fixed time windows** (also known as "tumbling windows") represent a consistent
+   duration, non overlapping time interval in the data stream.
+ * **Sliding time windows** (also known as "hopping windows") also represent time
+   intervals in the data stream; however, sliding time windows can overlap.
+ * **Per-session windows** define windows that contain elements that are within a
+   certain gap duration of another element.
+ * **Single global window**: by default, all data in a `PCollection` is assigned to
+   the single global window, and late data is discarded.
+ * **Calendar-based windows** (not supported by the Beam SDK for Python)
+
+You can also define your own windowing function if you have more complex
+requirements.
+
+For more information about windows, see the following page:
+
+ * [Beam Programming Guide: Windowing](/documentation/programming-guide/#windowing)
+
+### Watermark
+
+In any data processing system, there is a certain amount of lag between the time
+a data event occurs (the ā€œevent timeā€, determined by the timestamp on the data
+element itself) and the time the actual data element gets processed at any stage
+in your pipeline (the ā€œprocessing timeā€, determined by the clock on the system
+processing the element). In addition, there are no guarantees that data events

Review comment:
       Thanks, I rearranged this a bit. Added your example suggestions, moved the windowing example up to the windowing section (not sure how it got down here), and moved the "isn't always guaranteed" sentence up to the intro paragraph.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,

Review comment:
       This brings up a general question I've been looking at regarding elements in multiple windows. The docs seem to have (on the surface at least) contradictory statements on how many windows an element can be in.
   
   From existing section in https://beam.apache.org/documentation/basics/#windowed-elements :
   **No element resides in multiple windows**; two elements can be equal except for their window, but they are not the same.
   
   From https://beam.apache.org/documentation/programming-guide/#windowing-basics :
   Each element in a PCollection is **assigned to one or more windows** according to the PCollection's windowing function
   
   From https://beam.apache.org/documentation/programming-guide/#sliding-time-windows :
   Because multiple windows overlap, most elements in a data set will belong to **more than one window**. 
   
   This suggestion is another in the "one or more" column.
   
   However I've also heard that an element that falls into two different windows is actually considered two separate elements.
   
   What's the most accurate explanation here? Do you have a suggestion as to which way to document this?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org