You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "tvalentyn (via GitHub)" <gi...@apache.org> on 2023/07/28 23:31:55 UTC

[GitHub] [beam] tvalentyn opened a new pull request, #27749: Add the guidance on controlling pipeline dependencies.

tvalentyn opened a new pull request, #27749:
URL: https://github.com/apache/beam/pull/27749

   Adds the guidance on controlling pipeline dependencies.
   
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn merged pull request #27749: Add the guidance on controlling pipeline dependencies.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.

tvalentyn merged PR #27749:
URL: https://github.com/apache/beam/pull/27749


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] github-actions[bot] commented on pull request #27749: Add the guidance on controlling pipeline dependencies.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #27749:
URL: https://github.com/apache/beam/pull/27749#issuecomment-1656453213

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] rszper commented on a diff in pull request #27749: Add the guidance on controlling pipeline dependencies.

Posted by "rszper (via GitHub)" <gi...@apache.org>.

rszper commented on code in PR #27749:
URL: https://github.com/apache/beam/pull/27749#discussion_r1278181917


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.

Review Comment:
   ```suggestion
   To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with the installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/), and more.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.

Review Comment:
   ```suggestion
   When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.

Review Comment:
   ```suggestion
   * Avoid using the tag `:latest` with your custom images. Tag your builds with a date or a unique identifier. If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration and allow for an inspection of changes.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.
+
+To reduce side-effects on pipeline submissions, it is best to able to [recreate launch environment in a reproducible manner](#create-reproducible-environments).

Review Comment:
   ```suggestion
   To reduce side-effects on pipeline submissions, it is best to able to [recreate the launch environment in a reproducible manner](#create-reproducible-environments).
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.

Review Comment:
   ```suggestion
   When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.
+
+To reduce side-effects on pipeline submissions, it is best to able to [recreate launch environment in a reproducible manner](#create-reproducible-environments).
+
+[Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) provide an example of a containerized, reproducible launch environment.
+
+To create reproducible installations of Beam into a clean virtual environment, use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files) that list all Python dependencies included in Beam's default container images constraint files:
+
+```
+BEAM_VERSION=2.48.0
+PYTHON_VERSION=`python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')"`
+pip install apache-beam==$BEAM_VERSION --constraint https://raw.githubusercontent.com/apache/beam/release-${BEAM_VERSION}/sdks/python/container/py${PY_VERSION}/base_image_requirements.txt
+```
+
+Use a constraint file to ensure that Beam dependencies in the launch environment match the versions in default Beam containers. A constraint file might also remove the need for dependency resolution at installation time.
+
+### Make the launch environment compatible with the runtime environment
+
+The launch environment translates the  pipeline graph into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto). This process involves serializing (or pickling) the code of the transforms. The serialized content is deserialized on the workers. If the runtime worker environment significantly differs from the launch environment, runtime errors might occur for the following reasons:
+
+* Versions of `protobuf` in the submission and runtime environment need to match or be compatible.
+The Apache Beam version and the Python major.minor versions between submission and runtime environment must match. Otherwise, the pipeline might fail with errors like "Pipeline construction environment and pipeline runtime environment are not compatible." On older SDK versions, the error might be reported as "SystemError: unknown opcode".
+
+* Libraries used in the pipeline code might need to match. If serialized pipeline code has references to functions or modules that aren’t available on the workers, the pipeline might fail with ModuleNotFound or AttributeError exceptions on the remote runner. If you encounter such errors, make sure that the affected libraries are available on the remote worker, and check whether you need to [save the main session](  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session).

Review Comment:
   ```suggestion
   * Libraries used in the pipeline code might need to match. If serialized pipeline code has references to functions or modules that aren’t available on the workers, the pipeline might fail with `ModuleNotFound` or `AttributeError` exceptions on the remote runner. If you encounter such errors, make sure that the affected libraries are available on the remote worker, and check whether you need to [save the main session](  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session).
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.

Review Comment:
   ```suggestion
   * **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to restrict the installation of packages, allowing only specified versions.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.

Review Comment:
   ```suggestion
   * Do not modify container images when they in use by running pipelines.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.

Review Comment:
   ```suggestion
   The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to when it performs data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.
+
+To reduce side-effects on pipeline submissions, it is best to able to [recreate launch environment in a reproducible manner](#create-reproducible-environments).
+
+[Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) provide an example of a containerized, reproducible launch environment.
+
+To create reproducible installations of Beam into a clean virtual environment, use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files) that list all Python dependencies included in Beam's default container images constraint files:
+
+```
+BEAM_VERSION=2.48.0
+PYTHON_VERSION=`python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')"`
+pip install apache-beam==$BEAM_VERSION --constraint https://raw.githubusercontent.com/apache/beam/release-${BEAM_VERSION}/sdks/python/container/py${PY_VERSION}/base_image_requirements.txt
+```
+
+Use a constraint file to ensure that Beam dependencies in the launch environment match the versions in default Beam containers. A constraint file might also remove the need for dependency resolution at installation time.
+
+### Make the launch environment compatible with the runtime environment
+
+The launch environment translates the  pipeline graph into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto). This process involves serializing (or pickling) the code of the transforms. The serialized content is deserialized on the workers. If the runtime worker environment significantly differs from the launch environment, runtime errors might occur for the following reasons:
+
+* Versions of `protobuf` in the submission and runtime environment need to match or be compatible.
+The Apache Beam version and the Python major.minor versions between submission and runtime environment must match. Otherwise, the pipeline might fail with errors like "Pipeline construction environment and pipeline runtime environment are not compatible." On older SDK versions, the error might be reported as "SystemError: unknown opcode".

Review Comment:
   ```suggestion
   The Apache Beam version and the Python major.minor versions must match in the submission and runtime environments. Otherwise, the pipeline might fail with errors like `Pipeline construction environment and pipeline runtime environment are not compatible`. On older SDK versions, the error might be reported as `SystemError: unknown opcode`.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.
+
+To reduce side-effects on pipeline submissions, it is best to able to [recreate launch environment in a reproducible manner](#create-reproducible-environments).
+
+[Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) provide an example of a containerized, reproducible launch environment.
+
+To create reproducible installations of Beam into a clean virtual environment, use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files) that list all Python dependencies included in Beam's default container images constraint files:
+
+```
+BEAM_VERSION=2.48.0
+PYTHON_VERSION=`python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')"`
+pip install apache-beam==$BEAM_VERSION --constraint https://raw.githubusercontent.com/apache/beam/release-${BEAM_VERSION}/sdks/python/container/py${PY_VERSION}/base_image_requirements.txt
+```
+
+Use a constraint file to ensure that Beam dependencies in the launch environment match the versions in default Beam containers. A constraint file might also remove the need for dependency resolution at installation time.
+
+### Make the launch environment compatible with the runtime environment
+
+The launch environment translates the  pipeline graph into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto). This process involves serializing (or pickling) the code of the transforms. The serialized content is deserialized on the workers. If the runtime worker environment significantly differs from the launch environment, runtime errors might occur for the following reasons:
+
+* Versions of `protobuf` in the submission and runtime environment need to match or be compatible.
+The Apache Beam version and the Python major.minor versions between submission and runtime environment must match. Otherwise, the pipeline might fail with errors like "Pipeline construction environment and pipeline runtime environment are not compatible." On older SDK versions, the error might be reported as "SystemError: unknown opcode".
+
+* Libraries used in the pipeline code might need to match. If serialized pipeline code has references to functions or modules that aren’t available on the workers, the pipeline might fail with ModuleNotFound or AttributeError exceptions on the remote runner. If you encounter such errors, make sure that the affected libraries are available on the remote worker, and check whether you need to [save the main session](  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session).

Review Comment:
   ```suggestion
   * Libraries used in the pipeline code might need to match. If serialized pipeline code has references to functions or modules that aren’t available on the workers, the pipeline might fail with `ModuleNotFound` or `AttributeError` exceptions on the remote runner. If you encounter such errors, make sure that the affected libraries are available on the remote worker, and check whether you need to [save the main session](  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session).
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.
+
+To reduce side-effects on pipeline submissions, it is best to able to [recreate launch environment in a reproducible manner](#create-reproducible-environments).
+
+[Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) provide an example of a containerized, reproducible launch environment.
+
+To create reproducible installations of Beam into a clean virtual environment, use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files) that list all Python dependencies included in Beam's default container images constraint files:
+
+```
+BEAM_VERSION=2.48.0
+PYTHON_VERSION=`python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')"`
+pip install apache-beam==$BEAM_VERSION --constraint https://raw.githubusercontent.com/apache/beam/release-${BEAM_VERSION}/sdks/python/container/py${PY_VERSION}/base_image_requirements.txt
+```
+
+Use a constraint file to ensure that Beam dependencies in the launch environment match the versions in default Beam containers. A constraint file might also remove the need for dependency resolution at installation time.
+
+### Make the launch environment compatible with the runtime environment
+
+The launch environment translates the  pipeline graph into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto). This process involves serializing (or pickling) the code of the transforms. The serialized content is deserialized on the workers. If the runtime worker environment significantly differs from the launch environment, runtime errors might occur for the following reasons:
+
+* Versions of `protobuf` in the submission and runtime environment need to match or be compatible.
+The Apache Beam version and the Python major.minor versions between submission and runtime environment must match. Otherwise, the pipeline might fail with errors like "Pipeline construction environment and pipeline runtime environment are not compatible." On older SDK versions, the error might be reported as "SystemError: unknown opcode".
+
+* Libraries used in the pipeline code might need to match. If serialized pipeline code has references to functions or modules that aren’t available on the workers, the pipeline might fail with ModuleNotFound or AttributeError exceptions on the remote runner. If you encounter such errors, make sure that the affected libraries are available on the remote worker, and check whether you need to [save the main session](  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session).
+
+* The version of the pickling library used at submission time must match the version installed at runtime. To enforce this, Beam sets a tight bounds on the version of serializer libraries (dill and cloudpickle). You can force install a different version of `dill` or `cloudpickle` than required by Beam under the following conditions:
+  * You install the same version in submission and in the runtime environment.
+  * The chosen version works for your pipeline.
+
+To check whether the runtime environment matches the launch environment, inspect differences in the `pip freeze` output in both environments. Update to the latest version of Beam, because environment compatibility checks are included in newer SDK versions.
+
+Finally, you can use the same environment by launching the pipeline from the  containerized environment that you use at runtime. [Dataflow Flex templates built from a custom container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images) offer this setup. In this scenario, you can recreate both launch and runtime  environments in a reproducible manner. Because both containers are created from the same image, the launch and runtime environments are compatible with each other by default.

Review Comment:
   ```suggestion
   Finally, you can use the same environment by launching the pipeline from the  containerized environment that you use at runtime. [Dataflow Flex templates built from a custom container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images) offer this setup. In this scenario, you can recreate both launch and runtime environments in a reproducible manner. Because both containers are created from the same image, the launch and runtime environments are compatible with each other by default.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
+
+* Avoid using the tag `:latest`  with your custom images. Tag your builds with a date or a unique identifier.  If something goes wrong, using this type of tag might make it possible to revert the pipeline execution to a previously known working configuration  and allow for an inspection of changes.
+
+* Consider storing the output of `pip freeze` or the contents of `requirements.txt` in the version control system.
+
+### Make the pipeline launch environment reproducible
+
+The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the dev environment.

Review Comment:
   ```suggestion
   The launch environment runs the **production version** of the pipeline. While developing the pipeline locally, you might use a **development environment** that includes dependencies for development, such as Jupyter or Pylint. The launch environment for production pipelines might not need these additional dependencies. You can construct and maintain it separately from the development environment.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] rszper commented on a diff in pull request #27749: Add the guidance on controlling pipeline dependencies.

Posted by "rszper (via GitHub)" <gi...@apache.org>.

rszper commented on code in PR #27749:
URL: https://github.com/apache/beam/pull/27749#discussion_r1278182441


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job submission, and deseriali
 To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
 install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
 or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the pipeline into a [runner-independent representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto) and submits it for execution. Translation happens in the **launch environment**. You can launch the pipeline from a Python virtual environment with installed Beam SDK, or with tools like [Dataflow Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates), [Notebook environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development), [Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime environment**](https://beam.apache.org/documentation/runtime/environments/) is the Python environment that a runner uses during pipeline execution. This environment is where the pipeline code runs to perform data  processing. The runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).**  After you install dependencies, generate the requirements file by using `pip freeze > requirements.txt`. To recreate an environment, install dependencies from the requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can use the constraint list to  restrict the installation of packages, allowing only  specified versions.
+
+* **Use lock files.** Use dependency management tools like [PipEnv](https://pipenv.pypa.io/en/latest/), [Poetry](https://python-poetry.org/), and [pip-tools](https://github.com/jazzband/pip-tools) to specify top-level dependencies, to generate lock files of all transitive dependencies with pinned versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime environment inside a Docker container image. If the image includes all necessary dependencies, then the environment only changes when a container image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner, the workers on the runner use  the same dependencies each time the pipeline runs. A reproducible environment is immune to side-effects caused by releases of the pipeline's direct or transitive dependencies. It doesn’t require dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the `--requirements_file` pipeline option. Use the `--prebuild_sdk_container_engine` option to perform the runtime environment initialization sequence before the pipeline execution. If your dependencies don't change, reuse the prebuilt image by using the `--sdk_container_image` option.
+
+A self-contained runtime environment is usually reproducible. To check if the  runtime environment is self-contained, restrict internet access to PyPI in the pipeline runtime. If you use the Dataflow Runner, see the documentation for the [`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address) pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.

Review Comment:
   ```suggestion
   * Do not modify container images when they are in use by running pipelines.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on pull request #27749: Add the guidance on controlling pipeline dependencies.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.

tvalentyn commented on PR #27749:
URL: https://github.com/apache/beam/pull/27749#issuecomment-1656448917

   R: @AnandInguva 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org