You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "VeronicaWasson (via GitHub)" <gi...@apache.org> on 2024/04/03 20:31:01 UTC

[PR] Proposed edits for Beam YAML overview [beam]

VeronicaWasson opened a new pull request, #30842:
URL: https://github.com/apache/beam/pull/30842

   Makes the following changes
   
   - Organize into several main sections
   - Make the "Getting Started" section more procedural. 
   - Use a self-contained pipeline for Getting Started. (No input data files required)
   - Add explanatory text to motivate the example YAMLs
   - General style edits
   
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Proposed edits for Beam YAML overview [beam]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #30842:
URL: https://github.com/apache/beam/pull/30842#issuecomment-2035636272

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @rszper for label website.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Proposed edits for Beam YAML overview [beam]

Posted by "robertwb (via GitHub)" <gi...@apache.org>.
robertwb commented on code in PR #30842:
URL: https://github.com/apache/beam/pull/30842#discussion_r1552524970


##########
website/www/site/content/en/documentation/sdks/yaml.md:
##########
@@ -23,80 +23,132 @@ title: "Apache Beam YAML API"
 
 # Beam YAML API
 
-While Beam provides powerful APIs for authoring sophisticated data
-processing pipelines, it often still has too high a barrier for
-getting started and authoring simple pipelines. Even setting up the
-environment, installing the dependencies, and setting up the project
-can be an overwhelming amount of boilerplate for some (though
-https://beam.apache.org/blog/beam-starter-projects/ has gone a long
-way in making this easier).
-
-Here we provide a simple declarative syntax for describing pipelines
-that does not require coding experience or learning how to use an
-SDK&mdash;any text editor will do.
-Some installation may be required to actually *execute* a pipeline, but
-we envision various services (such as Dataflow) to accept yaml pipelines
-directly obviating the need for even that in the future.
-We also anticipate the ability to generate code directly from these
-higher-level yaml descriptions, should one want to graduate to a full
-Beam SDK (and possibly the other direction as well as far as possible).
-
-Though we intend this syntax to be easily authored (and read) directly by
-humans, this may also prove a useful intermediate representation for
-tools to use as well, either as output (e.g. a pipeline authoring GUI)
-or consumption (e.g. a lineage analysis tool) and expect it to be more
-easily manipulated and semantically meaningful than the Beam protos
-themselves (which concern themselves more with execution).
-
-It should be noted that everything here is still under development, but any
-features already included are considered stable. Feedback is welcome at
-dev@apache.beam.org.
-
-## Running pipelines
-
-The Beam yaml parser is currently included as part of the Apache Beam Python SDK.
-This can be installed (e.g. within a virtual environment) as
+Beam YAML is a declarative syntax for describing Apache Beam pipelines by using
+YAML files. You can use Beam YAML to author and run a Beam pipeline without
+writing any code.
+
+## Overview
+
+Beam provides a powerful model for creating sophisticated data processing
+pipelines. However, getting started with Beam programming can be challenging
+because it requires writing code in one of the supported Beam SDK languages.
+You need to understand the APIs, set up a project, manage dependencies, and
+perform other programming tasks.
+
+Beam YAML makes it easier to get started with creating Beam pipelines. Instead
+of writing code, you create a YAML file using any text editor. Then you submit
+the YAML file to be executed by a runner.
+
+The Beam YAML syntax is designed to be human-readable but also suitable as an
+intermediate representation for tools. For example, a pipeline authoring GUI
+could output YAML, or a lineage analysis tool could consume the YAML pipeline
+specifications.
+
+Beam YAML is still under development, but any features already included are
+considered stable. Feedback is welcome at dev@apache.beam.org.
+
+## Prerequisites
+
+The Beam YAML parser is currently included as part of the
+[Apache Beam Python SDK](../python/). You don't need to write Python code to use
+Beam YAML, but you need the SDK to run pipelines locally.
+
+We recommend creating a
+[virtual environment](../../../get-started/quickstart/python/#create-and-activate-a-virtual-environment)
+so that all packages are installed in an isolated and self-contained
+environment. After you set up your Python environment, install the SDK as
+follows:
 
 ```
 pip install apache_beam[yaml,gcp]
 ```
 
-In addition, several of the provided transforms (such as SQL) are implemented
-in Java and their expansion will require a working Java interpeter. (The
-requisite artifacts will be automatically downloaded from the apache maven
-repositories, so no further installs will be required.)
-Docker is also currently required for local execution of these
-cross-language-requiring transforms, but not for submission to a non-local
-runner such as Flink or Dataflow.
+In addition, several of the provided transforms, such as the SQL transform, are
+implemented in Java and require a working Java interpeter. When you a run a
+pipeline with these transforms, the required artifacts are automatically
+downloaded from the Apache Maven repositories. To execute these cross-language
+transforms locally, you must have Docker installed on your local machine.

Review Comment:
   The part about Docker is no longer true since https://github.com/apache/beam/pull/29283 . (Are there other places we should be updating this as well?)



##########
website/www/site/content/en/documentation/sdks/yaml.md:
##########
@@ -23,80 +23,132 @@ title: "Apache Beam YAML API"
 
 # Beam YAML API
 
-While Beam provides powerful APIs for authoring sophisticated data
-processing pipelines, it often still has too high a barrier for
-getting started and authoring simple pipelines. Even setting up the
-environment, installing the dependencies, and setting up the project
-can be an overwhelming amount of boilerplate for some (though
-https://beam.apache.org/blog/beam-starter-projects/ has gone a long
-way in making this easier).
-
-Here we provide a simple declarative syntax for describing pipelines
-that does not require coding experience or learning how to use an
-SDK&mdash;any text editor will do.
-Some installation may be required to actually *execute* a pipeline, but
-we envision various services (such as Dataflow) to accept yaml pipelines
-directly obviating the need for even that in the future.
-We also anticipate the ability to generate code directly from these
-higher-level yaml descriptions, should one want to graduate to a full
-Beam SDK (and possibly the other direction as well as far as possible).
-
-Though we intend this syntax to be easily authored (and read) directly by
-humans, this may also prove a useful intermediate representation for
-tools to use as well, either as output (e.g. a pipeline authoring GUI)
-or consumption (e.g. a lineage analysis tool) and expect it to be more
-easily manipulated and semantically meaningful than the Beam protos
-themselves (which concern themselves more with execution).
-
-It should be noted that everything here is still under development, but any
-features already included are considered stable. Feedback is welcome at
-dev@apache.beam.org.
-
-## Running pipelines
-
-The Beam yaml parser is currently included as part of the Apache Beam Python SDK.
-This can be installed (e.g. within a virtual environment) as
+Beam YAML is a declarative syntax for describing Apache Beam pipelines by using
+YAML files. You can use Beam YAML to author and run a Beam pipeline without
+writing any code.
+
+## Overview
+
+Beam provides a powerful model for creating sophisticated data processing
+pipelines. However, getting started with Beam programming can be challenging
+because it requires writing code in one of the supported Beam SDK languages.
+You need to understand the APIs, set up a project, manage dependencies, and
+perform other programming tasks.
+
+Beam YAML makes it easier to get started with creating Beam pipelines. Instead

Review Comment:
   +1, that's a good idea. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Proposed edits for Beam YAML overview [beam]

Posted by "rszper (via GitHub)" <gi...@apache.org>.
rszper commented on code in PR #30842:
URL: https://github.com/apache/beam/pull/30842#discussion_r1550546970


##########
website/www/site/content/en/documentation/sdks/yaml.md:
##########
@@ -23,80 +23,132 @@ title: "Apache Beam YAML API"
 
 # Beam YAML API
 
-While Beam provides powerful APIs for authoring sophisticated data
-processing pipelines, it often still has too high a barrier for
-getting started and authoring simple pipelines. Even setting up the
-environment, installing the dependencies, and setting up the project
-can be an overwhelming amount of boilerplate for some (though
-https://beam.apache.org/blog/beam-starter-projects/ has gone a long
-way in making this easier).
-
-Here we provide a simple declarative syntax for describing pipelines
-that does not require coding experience or learning how to use an
-SDK&mdash;any text editor will do.
-Some installation may be required to actually *execute* a pipeline, but
-we envision various services (such as Dataflow) to accept yaml pipelines
-directly obviating the need for even that in the future.
-We also anticipate the ability to generate code directly from these
-higher-level yaml descriptions, should one want to graduate to a full
-Beam SDK (and possibly the other direction as well as far as possible).
-
-Though we intend this syntax to be easily authored (and read) directly by
-humans, this may also prove a useful intermediate representation for
-tools to use as well, either as output (e.g. a pipeline authoring GUI)
-or consumption (e.g. a lineage analysis tool) and expect it to be more
-easily manipulated and semantically meaningful than the Beam protos
-themselves (which concern themselves more with execution).
-
-It should be noted that everything here is still under development, but any
-features already included are considered stable. Feedback is welcome at
-dev@apache.beam.org.
-
-## Running pipelines
-
-The Beam yaml parser is currently included as part of the Apache Beam Python SDK.
-This can be installed (e.g. within a virtual environment) as
+Beam YAML is a declarative syntax for describing Apache Beam pipelines by using
+YAML files. You can use Beam YAML to author and run a Beam pipeline without
+writing any code.
+
+## Overview
+
+Beam provides a powerful model for creating sophisticated data processing
+pipelines. However, getting started with Beam programming can be challenging
+because it requires writing code in one of the supported Beam SDK languages.
+You need to understand the APIs, set up a project, manage dependencies, and
+perform other programming tasks.
+
+Beam YAML makes it easier to get started with creating Beam pipelines. Instead

Review Comment:
   If possible, I would find a way to make this the first paragraph in the overview so that we start by listing the benefits of Beam YAML instead of the challenges of normal Beam. That might require some rewriting, though, so just a suggestion. Feel free to ignore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Proposed edits for Beam YAML overview [beam]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #30842:
URL: https://github.com/apache/beam/pull/30842#issuecomment-2035683097

   R: @melap for final approval


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Proposed edits for Beam YAML overview [beam]

Posted by "robertwb (via GitHub)" <gi...@apache.org>.
robertwb merged PR #30842:
URL: https://github.com/apache/beam/pull/30842


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org