You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/10/20 21:58:10 UTC
[GitHub] [beam] TheNeuralBit commented on a change in pull request #15720: [BEAM-11758] Update basics page: Pipeline, PCollection, PTransform

TheNeuralBit commented on a change in pull request #15720:
URL: https://github.com/apache/beam/pull/15720#discussion_r733171952



##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in

Review comment:
       nit: the fact that it's a DAG might be more than most most readers care about, but it's an interesting side note
   
   ```suggestion
   A Beam pipeline is a graph (technically, a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) of all the data and computations in
   ```

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in

Review comment:
       Also, do you think it would be valuable to foreshadow that data == PCollection and computations == PTransforms and link to those sections?

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform
+   conceptually has no input.
+ * Processing and conversion operations such as `ParDo`, `GroupByKey`,
+   `CoGroupByKey`, `Combine`, and `Count`.
+ * Outputting transforms like `TextIO.Write`.
+ * User-defined, application-specific composite transforms.
+
+For more information about transforms, see the following pages:
+
+ * [Beam Programming Guide: Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Transforms](/documentation/programming-guide/#transforms)
+ * Beam transform catalog ([Java](/documentation/transforms/java/overview/),
+   [Python](/documentation/transforms/python/overview/))
 
 ### PCollections
 
-A PCollection is an unordered bag of elements. Your runner will be responsible
-for storing these elements.  There are some major aspects of a PCollection to
-note:
+Beam pipelines process PCollections. A `PCollection` is a potentially
+distributed, homogeneous data set or data stream, and is owned by the specific
+`Pipeline` object for which it is created. Multiple pipelines cannot share a
+`PCollection`. The runner is responsible for storing these elements.
+
+A PCollection generally contains "big data" (too much data to fit in memory on a
+single machine). Sometimes a small sample of data or an intermediate result
+might fit into memory on a single machine, but Beam's computational patterns and
+transforms are focused on situations where distributed data-parallel computation
+is required. Therefore, the elements of a `PCollection` cannot be processed
+individually, and are instead processed uniformly in parallel.
 
-#### Bounded vs Unbounded
+There are some major aspects of a PCollection to note:
 
-A PCollection may be bounded or unbounded.
+####  Bounded vs unbounded
 
- - _Bounded_ - it is finite and you know it, as in batch use cases
- - _Unbounded_ - it may be never end, you don't know, as in streaming use cases
+A `PCollection` can be either bounded or unbounded.
 
-These derive from the intuitions of batch and stream processing, but the two
-are unified in Beam and bounded and unbounded PCollections can coexist in the
-same pipeline. If your runner can only support bounded PCollections, you'll
-need to reject pipelines that contain unbounded PCollections. If your
-runner is only really targeting streams, there are adapters in our support code
-to convert everything to APIs targeting unbounded data.
+ - _Bounded_ - A bounded `PCollection` is a dataset of a known, fixed size
+   (alternatively, a dataset that is not growing over time). Bounded data can
+   be processed by batch pipelines.
+ - _Unbounded_ - An unbounded PCollection is a dataset that grows over time,
+   with elements processed as they arrive. Unbounded data must be processed by
+   streaming pipelines.

Review comment:
       ```suggestion
    - A _bounded_ `PCollection` is a dataset of a known, fixed size
      (alternatively, a dataset that is not growing over time). Bounded data can
      be processed by batch pipelines.
    - An _unbounded_ `PCollection` is a dataset that grows over time,
      with elements processed as they arrive. Unbounded data must be processed by
      streaming pipelines.
   ```

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform

Review comment:
       I'm not sure we use the term root anywhere else. I might call these "source" or "input" transforms instead.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform
+   conceptually has no input.
+ * Processing and conversion operations such as `ParDo`, `GroupByKey`,
+   `CoGroupByKey`, `Combine`, and `Count`.
+ * Outputting transforms like `TextIO.Write`.
+ * User-defined, application-specific composite transforms.
+
+For more information about transforms, see the following pages:
+
+ * [Beam Programming Guide: Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Transforms](/documentation/programming-guide/#transforms)
+ * Beam transform catalog ([Java](/documentation/transforms/java/overview/),
+   [Python](/documentation/transforms/python/overview/))
 
 ### PCollections
 
-A PCollection is an unordered bag of elements. Your runner will be responsible

Review comment:
       nit: I kind of like the "unordered bag of elements" analogy, might be nice to keep it in somehow

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -22,11 +22,13 @@ of operations. You want to integrate it with the Beam ecosystem to get access
 to other languages, great event time processing, and a library of connectors.
 You need to know the core vocabulary:
 
- * [_Pipeline_](#pipeline) - A pipeline is a graph of transformations that a user constructs
-   that defines the data processing they want to do.
- * [_PCollection_](#pcollections) - Data being processed in a pipeline is part of a PCollection.
- * [_PTransforms_](#ptransforms) - The operations executed within a pipeline. These are best
-   thought of as operations on PCollections.
+ * [_Pipeline_](#pipeline) - A pipeline is a user-constructed graph of
+   transformations that defines the desired data processing operations.
+ * [_PCollection_](#pcollections) - A `PCollection` is a data set or data
+   stream. The data that a pipeline processes is part of a PCollection.
+ * [_PTransforms_](#ptransforms) - A `PTransform` (or _transform_) represents a
+   data processing operation, or a step, in your pipeline. A transform can be
+   applied to one or more `PCollection` objects.

Review comment:
       There's some more nuance here, I'm not sure if we want to get into it or not, I'll leave that up to you.
   
   - Transforms also (usually) produce one or more `PCollection`s, but they can also produce zero collections (i.e. a transform that writes output).
   - A transform isn't always applied to one or more collections, a transform that reads input can have zero input collections.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org