You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/03/04 21:02:00 UTC

[jira] [Work logged] (BEAM-13314) Revise recommendations to manage Python pipeline dependencies.

     [ https://issues.apache.org/jira/browse/BEAM-13314?focusedWorklogId=736899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-736899 ]

ASF GitHub Bot logged work on BEAM-13314:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Mar/22 21:01
            Start Date: 04/Mar/22 21:01
    Worklog Time Spent: 10m 
      Work Description: tvalentyn commented on a change in pull request #16938:
URL: https://github.com/apache/beam/pull/16938#discussion_r819900366



##########
File path: website/www/site/content/en/documentation/runtime/environments.md
##########
@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_
 
 By default, no licenses/notices are added to the docker images.
 
+#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
+Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.
+
+
+1. Copy necessary artifacts from Apache Beam base image to your image.
+  ```
+  # This can be any container image,
+ FROM python:3.8-slim

Review comment:
       mismatch between py3.8 and py3.7 below

##########
File path: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py
     The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.
 
 **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
+> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline.
+## Custom Containers {#custom-containers}
+
+You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).
+
+1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime.
+
+       # Add these lines with the path to the requirements.txt to the Dockerfile
+
+       COPY <path to requirements.txt> /tmp/requirements.txt
+       RUN python -m pip download -r /tmp/requirements.txt
+
+**Note:** [Different approaches](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the container images that would be compatible with Apache Beam Runners.

Review comment:
       I don't think this is relevant here

##########
File path: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py
     The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.
 
 **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
+> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline.
+## Custom Containers {#custom-containers}
+
+You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).
+
+1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime.
+
+       # Add these lines with the path to the requirements.txt to the Dockerfile
+
+       COPY <path to requirements.txt> /tmp/requirements.txt
+       RUN python -m pip download -r /tmp/requirements.txt

Review comment:
       why pip download and not pip install ?

##########
File path: website/www/site/content/en/documentation/runtime/environments.md
##########
@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_
 
 By default, no licenses/notices are added to the docker images.
 
+#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}

Review comment:
       @emilymye could you PTAL at this section?

##########
File path: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -123,3 +136,19 @@ If your pipeline uses non-Python packages (e.g. packages that require installati
         --setup_file /path/to/setup.py
 
 **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file.
+
+## Pre-building SDK container image

Review comment:
       @y1chi could you PTAL at this section?

##########
File path: website/www/site/content/en/documentation/runtime/environments.md
##########
@@ -46,7 +46,7 @@ Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=i
 
 1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.
 2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).
-
+3. **[Build](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.

Review comment:
       ```suggestion
   3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.
   ```
   
   Also: one of three ways above.

##########
File path: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py
     The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.
 
 **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
+> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline.

Review comment:
       `pip freeze`, not `pip check`
   
   you can explain:
   
   `...to compile the `requirements.txt` all transitive dependencies from a smaller set of requirements.```

##########
File path: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py
     The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.
 
 **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
+> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline.
+## Custom Containers {#custom-containers}
+
+You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).
+
+1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime.

Review comment:
       If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time. Fore example:...

##########
File path: website/www/site/content/en/documentation/runtime/environments.md
##########
@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_
 
 By default, no licenses/notices are added to the docker images.
 
+#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
+Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.

Review comment:
       ```suggestion
   Beam offers a way to take a Beam container image and customize it. But if you have an existing base image that you need to make compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 736899)
    Remaining Estimate: 0h
            Time Spent: 10m

> Revise recommendations to manage Python pipeline dependencies. 
> ---------------------------------------------------------------
>
>                 Key: BEAM-13314
>                 URL: https://issues.apache.org/jira/browse/BEAM-13314
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core, website
>            Reporter: Valentyn Tymofieiev
>            Assignee: Anand Inguva
>            Priority: P2
>              Labels: usability
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The page  https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ recommends managing Python dependencies via requirements files.
> This approach is currently inefficient in light of introduction and adoption of PEP-517 by some packages, see: https://lists.apache.org/thread/trljnxo39c0cmff790yff3h8n5okqt3q  and the rest of the thread, and does not mention Custom Containers or SDK prebuilding workflows.
>  
> We should revise it and document best practices.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)