You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/11/08 19:50:10 UTC

[GitHub] [beam] kennknowles commented on a change in pull request #15780: [BEAM-11758] Update basics page: Trigger, State and timers

kennknowles commented on a change in pull request #15780:
URL: https://github.com/apache/beam/pull/15780#discussion_r745041348



##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +369,108 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Trigger
+
+When collecting and grouping data into windows, Beam uses _triggers_ to
+determine when to emit the aggregated results of each window (referred to as a
+_pane_). If you use Beam’s default windowing configuration and default trigger,
+Beam outputs the aggregated result when it estimates all data has arrived, and
+discards all subsequent data for that window.
+
+At a high level, triggers provide two additional capabilities compared to
+outputting at the end of a window:
+
+ 1. Triggers allow Beam to emit early results, before all the data in a given
+    window has arrived. For example, emitting after a certain amount of time
+    elapses, or after a certain number of elements arrives.
+ 2. Triggers allow processing of late data by triggering after the event time
+    watermark passes the end of the window.
+
+These capabilities allow you to control the flow of your data and also balance
+between data completeness, latency, and cost.
+
+Beam provides a number of pre-built triggers that you can set:
+
+ * **Event time triggers**: These triggers operate on the event time, as
+   indicated by the timestamp on each data element. Beam’s default trigger is
+   event time-based.
+ * **Processing time triggers**: These triggers operate on the processing time,
+   which is the time when the data element is processed at any given stage in
+   the pipeline.
+ * **Data-driven triggers**: These triggers operate by examining the data as it
+   arrives in each window, and firing when that data meets a certain property.
+   Currently, data-driven triggers only support firing after a certain number of
+   data elements.
+ * **Composite triggers**: These triggers combine multiple triggers in various

Review comment:
       Are you doing inline examples here? Like "for example, to one trigger for early data and a different trigger for late data"

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +369,108 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Trigger
+
+When collecting and grouping data into windows, Beam uses _triggers_ to
+determine when to emit the aggregated results of each window (referred to as a
+_pane_). If you use Beam’s default windowing configuration and default trigger,
+Beam outputs the aggregated result when it estimates all data has arrived, and
+discards all subsequent data for that window.
+
+At a high level, triggers provide two additional capabilities compared to
+outputting at the end of a window:
+
+ 1. Triggers allow Beam to emit early results, before all the data in a given
+    window has arrived. For example, emitting after a certain amount of time
+    elapses, or after a certain number of elements arrives.
+ 2. Triggers allow processing of late data by triggering after the event time
+    watermark passes the end of the window.
+
+These capabilities allow you to control the flow of your data and also balance
+between data completeness, latency, and cost.
+
+Beam provides a number of pre-built triggers that you can set:
+
+ * **Event time triggers**: These triggers operate on the event time, as
+   indicated by the timestamp on each data element. Beam’s default trigger is
+   event time-based.
+ * **Processing time triggers**: These triggers operate on the processing time,
+   which is the time when the data element is processed at any given stage in
+   the pipeline.
+ * **Data-driven triggers**: These triggers operate by examining the data as it
+   arrives in each window, and firing when that data meets a certain property.
+   Currently, data-driven triggers only support firing after a certain number of
+   data elements.
+ * **Composite triggers**: These triggers combine multiple triggers in various
+   ways.
+
+For more information about triggers, see the following page:
+
+ * [Beam Programming Guide: Triggers](/documentation/programming-guide/#triggers)
+
+### State and timers
+
+Beam’s windowing and triggers provide an abstraction for grouping and
+aggregating unbounded input data based on timestamps. However, there are
+aggregation use cases that might require an even higher degree of control. State
+and timers are two important concepts that help with these uses cases.
+
+**State**:
+
+Beam provides the State API for manually managing per-key state, allowing for
+fine-grained control over aggregations.  The State API lets you augment
+element-wise operations (for example, `ParDo` or `Map`) with mutable state.
+
+The State API models state per key. To use the state API, you start out with a
+keyed `PCollection`. A `ParDo` that processes this `PCollection` can declare
+persistent state variables. When you process each element inside the `ParDo`,
+you can use the state variables to write or update state for the current key or
+to read previous state written for that key. State is always fully scoped only
+to the current processing key.
+
+Beam provides several types of state:
+
+ * **ValueState**: A ValueState is a scalar state value. For each key in the
+   input, a ValueState stores a typed value that can be read and modified inside
+   the `DoFn`.
+ * **CombiningState**: CombiningState allows you to create a state object that is
+   updated using a Beam combiner.

Review comment:
       If you put this after BagState, you can say "like BagState, you can add elements to an aggregation without having to read the current value, and the accumulator can be compacted using a combiner".

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +369,108 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Trigger
+
+When collecting and grouping data into windows, Beam uses _triggers_ to
+determine when to emit the aggregated results of each window (referred to as a
+_pane_). If you use Beam’s default windowing configuration and default trigger,
+Beam outputs the aggregated result when it estimates all data has arrived, and
+discards all subsequent data for that window.
+
+At a high level, triggers provide two additional capabilities compared to
+outputting at the end of a window:
+
+ 1. Triggers allow Beam to emit early results, before all the data in a given
+    window has arrived. For example, emitting after a certain amount of time
+    elapses, or after a certain number of elements arrives.
+ 2. Triggers allow processing of late data by triggering after the event time
+    watermark passes the end of the window.
+
+These capabilities allow you to control the flow of your data and also balance
+between data completeness, latency, and cost.
+
+Beam provides a number of pre-built triggers that you can set:
+
+ * **Event time triggers**: These triggers operate on the event time, as
+   indicated by the timestamp on each data element. Beam’s default trigger is
+   event time-based.
+ * **Processing time triggers**: These triggers operate on the processing time,
+   which is the time when the data element is processed at any given stage in
+   the pipeline.
+ * **Data-driven triggers**: These triggers operate by examining the data as it
+   arrives in each window, and firing when that data meets a certain property.
+   Currently, data-driven triggers only support firing after a certain number of
+   data elements.
+ * **Composite triggers**: These triggers combine multiple triggers in various
+   ways.
+
+For more information about triggers, see the following page:
+
+ * [Beam Programming Guide: Triggers](/documentation/programming-guide/#triggers)
+
+### State and timers
+
+Beam’s windowing and triggers provide an abstraction for grouping and
+aggregating unbounded input data based on timestamps. However, there are
+aggregation use cases that might require an even higher degree of control. State
+and timers are two important concepts that help with these uses cases.
+
+**State**:
+
+Beam provides the State API for manually managing per-key state, allowing for
+fine-grained control over aggregations.  The State API lets you augment
+element-wise operations (for example, `ParDo` or `Map`) with mutable state.
+
+The State API models state per key. To use the state API, you start out with a
+keyed `PCollection`. A `ParDo` that processes this `PCollection` can declare
+persistent state variables. When you process each element inside the `ParDo`,
+you can use the state variables to write or update state for the current key or
+to read previous state written for that key. State is always fully scoped only
+to the current processing key.
+
+Beam provides several types of state:

Review comment:
       Now we've got MapState and SetState and even OrderedListState. (of course documenting some is better than none!)

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +369,108 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Trigger
+
+When collecting and grouping data into windows, Beam uses _triggers_ to
+determine when to emit the aggregated results of each window (referred to as a
+_pane_). If you use Beam’s default windowing configuration and default trigger,
+Beam outputs the aggregated result when it estimates all data has arrived, and
+discards all subsequent data for that window.
+
+At a high level, triggers provide two additional capabilities compared to
+outputting at the end of a window:
+
+ 1. Triggers allow Beam to emit early results, before all the data in a given
+    window has arrived. For example, emitting after a certain amount of time
+    elapses, or after a certain number of elements arrives.
+ 2. Triggers allow processing of late data by triggering after the event time
+    watermark passes the end of the window.
+
+These capabilities allow you to control the flow of your data and also balance
+between data completeness, latency, and cost.
+
+Beam provides a number of pre-built triggers that you can set:
+
+ * **Event time triggers**: These triggers operate on the event time, as
+   indicated by the timestamp on each data element. Beam’s default trigger is
+   event time-based.
+ * **Processing time triggers**: These triggers operate on the processing time,
+   which is the time when the data element is processed at any given stage in
+   the pipeline.
+ * **Data-driven triggers**: These triggers operate by examining the data as it
+   arrives in each window, and firing when that data meets a certain property.
+   Currently, data-driven triggers only support firing after a certain number of
+   data elements.
+ * **Composite triggers**: These triggers combine multiple triggers in various
+   ways.
+
+For more information about triggers, see the following page:
+
+ * [Beam Programming Guide: Triggers](/documentation/programming-guide/#triggers)
+
+### State and timers
+
+Beam’s windowing and triggers provide an abstraction for grouping and
+aggregating unbounded input data based on timestamps. However, there are
+aggregation use cases that might require an even higher degree of control. State
+and timers are two important concepts that help with these uses cases.
+
+**State**:
+
+Beam provides the State API for manually managing per-key state, allowing for
+fine-grained control over aggregations.  The State API lets you augment
+element-wise operations (for example, `ParDo` or `Map`) with mutable state.
+
+The State API models state per key. To use the state API, you start out with a
+keyed `PCollection`. A `ParDo` that processes this `PCollection` can declare
+persistent state variables. When you process each element inside the `ParDo`,
+you can use the state variables to write or update state for the current key or
+to read previous state written for that key. State is always fully scoped only
+to the current processing key.
+
+Beam provides several types of state:
+
+ * **ValueState**: A ValueState is a scalar state value. For each key in the
+   input, a ValueState stores a typed value that can be read and modified inside
+   the `DoFn`.
+ * **CombiningState**: CombiningState allows you to create a state object that is
+   updated using a Beam combiner.
+ * **BagState**: A common use case for state is to accumulate multiple elements.
+   BagState allows you to accumulate an unordered set of elements. This lets you
+   add elements to the collection without needing to read the entire collection

Review comment:
       Specifically, you don't need to read _any_ of the previously accumulated elements.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -365,6 +369,108 @@ For more information about runners, see the following pages:
  * [Choosing a Runner](/documentation/#choosing-a-runner)
  * [Beam Capability Matrix](/documentation/runners/capability-matrix/)
 
+### Trigger
+
+When collecting and grouping data into windows, Beam uses _triggers_ to
+determine when to emit the aggregated results of each window (referred to as a
+_pane_). If you use Beam’s default windowing configuration and default trigger,
+Beam outputs the aggregated result when it estimates all data has arrived, and
+discards all subsequent data for that window.
+
+At a high level, triggers provide two additional capabilities compared to
+outputting at the end of a window:
+
+ 1. Triggers allow Beam to emit early results, before all the data in a given
+    window has arrived. For example, emitting after a certain amount of time
+    elapses, or after a certain number of elements arrives.
+ 2. Triggers allow processing of late data by triggering after the event time
+    watermark passes the end of the window.
+
+These capabilities allow you to control the flow of your data and also balance
+between data completeness, latency, and cost.
+
+Beam provides a number of pre-built triggers that you can set:
+
+ * **Event time triggers**: These triggers operate on the event time, as
+   indicated by the timestamp on each data element. Beam’s default trigger is
+   event time-based.
+ * **Processing time triggers**: These triggers operate on the processing time,
+   which is the time when the data element is processed at any given stage in
+   the pipeline.
+ * **Data-driven triggers**: These triggers operate by examining the data as it
+   arrives in each window, and firing when that data meets a certain property.
+   Currently, data-driven triggers only support firing after a certain number of
+   data elements.
+ * **Composite triggers**: These triggers combine multiple triggers in various
+   ways.
+
+For more information about triggers, see the following page:
+
+ * [Beam Programming Guide: Triggers](/documentation/programming-guide/#triggers)
+
+### State and timers
+
+Beam’s windowing and triggers provide an abstraction for grouping and
+aggregating unbounded input data based on timestamps. However, there are
+aggregation use cases that might require an even higher degree of control. State
+and timers are two important concepts that help with these uses cases.
+
+**State**:
+
+Beam provides the State API for manually managing per-key state, allowing for

Review comment:
       State is also per window. Like you described for combine and groupbykey, it is implicit. "Like other aggregations, state and timers are processed per window"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org