You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/11/08 19:44:45 UTC

[GitHub] [beam] kennknowles commented on a change in pull request #15778: [BEAM-11758] Update basics page: Window, Watermark

kennknowles commented on a change in pull request #15778:
URL: https://github.com/apache/beam/pull/15778#discussion_r745036962



##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,

Review comment:
       one or more initial windows (maybe even zero is possible but that is wonky)

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,
+work implicitly on a per-window basis; they process each `PCollection` as a
+succession of multiple, finite windows, though the entire collection itself may
+be of unbounded size.
+
+Beam provides several windowing functions:
+
+ * **Fixed time windows** (also known as "tumbling windows") represent a consistent
+   duration, non overlapping time interval in the data stream.
+ * **Sliding time windows** (also known as "hopping windows") also represent time
+   intervals in the data stream; however, sliding time windows can overlap.
+ * **Per-session windows** define windows that contain elements that are within a
+   certain gap duration of another element.
+ * **Single global window**: by default, all data in a `PCollection` is assigned to
+   the single global window, and late data is discarded.
+ * **Calendar-based windows** (not supported by the Beam SDK for Python)
+
+You can also define your own windowing function if you have more complex
+requirements.
+
+For more information about windows, see the following page:
+
+ * [Beam Programming Guide: Windowing](/documentation/programming-guide/#windowing)
+
+### Watermark
+
+In any data processing system, there is a certain amount of lag between the time
+a data event occurs (the “event time”, determined by the timestamp on the data
+element itself) and the time the actual data element gets processed at any stage
+in your pipeline (the “processing time”, determined by the clock on the system
+processing the element). In addition, there are no guarantees that data events
+will appear in your pipeline in the same order that they were generated.
+
+For example, let’s say we have a PCollection that’s using fixed-time windowing,
+with windows that are five minutes long. For each window, Beam must collect all
+the data with an event time timestamp in the given window range (between 0:00
+and 4:59 in the first window, for instance). Data with timestamps outside that
+range (data from 5:00 or later) belong to a different window.
+
+However, data isn’t always guaranteed to arrive in a pipeline in time order, or

Review comment:
       down here you say "isn't always guaranteed" which seems a good phrasing

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,
+work implicitly on a per-window basis; they process each `PCollection` as a
+succession of multiple, finite windows, though the entire collection itself may
+be of unbounded size.
+
+Beam provides several windowing functions:
+
+ * **Fixed time windows** (also known as "tumbling windows") represent a consistent
+   duration, non overlapping time interval in the data stream.
+ * **Sliding time windows** (also known as "hopping windows") also represent time
+   intervals in the data stream; however, sliding time windows can overlap.
+ * **Per-session windows** define windows that contain elements that are within a
+   certain gap duration of another element.
+ * **Single global window**: by default, all data in a `PCollection` is assigned to
+   the single global window, and late data is discarded.
+ * **Calendar-based windows** (not supported by the Beam SDK for Python)
+
+You can also define your own windowing function if you have more complex
+requirements.
+
+For more information about windows, see the following page:
+
+ * [Beam Programming Guide: Windowing](/documentation/programming-guide/#windowing)
+
+### Watermark
+
+In any data processing system, there is a certain amount of lag between the time
+a data event occurs (the “event time”, determined by the timestamp on the data
+element itself) and the time the actual data element gets processed at any stage
+in your pipeline (the “processing time”, determined by the clock on the system
+processing the element). In addition, there are no guarantees that data events

Review comment:
       The guarantees depend on the entirety of the system - sometimes there might be. Maybe "in many cases, data events can appear in your pipeline in a different order than they were generated". Could also be worth mentioning examples like intermediate systems that don't preserve order but also that data might come from different sources. If two independent servers are timestamping things and one has a better connection they won't be merged into an in-order stream.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -42,6 +42,14 @@ understand an important set of core concepts:
    them to a runner.
  * [_Runner_](#runner) - A runner runs a Beam pipeline using the capabilities of
    your chosen data processing engine.
+ * [_Window_](#window) - A `PCollection` can be subdivided into windows based on

Review comment:
       Technically not always "subdivided" per se. But TBH close enough, if you consider sliding windows to duplicate elements and then subdivide that result of that. I don't know how pedantic you want to be. I started writing this comment thinking we should change it but now I think keeping it as-is might be the best concise way of helping someone understand what windows are and the purpose they serve.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,

Review comment:
       This is really really clear. Nice

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +373,76 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Window
+
+Windowing subdivides a `PCollection` into _windows_ according to the timestamps
+of its individual elements. Windows enable grouping operations over unbounded
+collections by dividing the collection into windows of finite collections. A
+windowing function tells the runner how to assign elements to an initial window,
+and how to merge windows of grouped elements. Two concepts are closely related
+to windowing: [watermarks](#watermark) and triggers.
+
+Transforms that aggregate multiple elements, such as `GroupByKey` and `Combine`,
+work implicitly on a per-window basis; they process each `PCollection` as a
+succession of multiple, finite windows, though the entire collection itself may
+be of unbounded size.
+
+Beam provides several windowing functions:
+
+ * **Fixed time windows** (also known as "tumbling windows") represent a consistent
+   duration, non overlapping time interval in the data stream.

Review comment:
       non-overlapping? (hyphen)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org