You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/10 22:13:48 UTC

[GitHub] [beam] alxmrs opened a new issue, #22675: [Feature Request]: Extend Python's Setup Options to install dependencies from an Anaconda `environment.yml` file.

alxmrs opened a new issue, #22675:
URL: https://github.com/apache/beam/issues/22675

   ### What would you like to happen?
   
   I'd like to offer an extension to my feature request in #22349 (for `conda` docker support in the Python Beam SDK Docker image). Given that scientific software is most often distributed via the Anaconda package manager (typically, from `conda-forge`), I propose that users of the Python Beam SDK be allowed to add python dependencies via a `environment.yml` file. I'm imagining something like: 
   ```
   python era5_climatology.py --conda_environment environment.yml --runner DataflowRunner  # etc...
   ```
   
   After the user specifies the environment, the remote Beam runner should set up a Docker image with the Anaconda package manager, and install all of the dependencies expressed in the `environment.yml` file in the global runtime environment. These packages should be useable from each step in the pipeline.
   
   I anticipate that such a feature would be really valuable to members of the scientific python community, who are more versed with Anaconda environments over Docker. Indeed, this could drastically simplify setting up dependencies for Python users, saving them from compiling scientific packages in Docker or debugging `pip` and `setuptools`. 
   
   There is a tradeoff, however: Docker will offer faster start times than installing dependencies as runtime. I see this feature fitting along side the `pip`, `tar` and `setup.py` approach that already exists for [managing python dependencies](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/).
   
   This feature would be dependent on the existence of standard a standard Anaconda SDK image (see connected issue above).
   
   ## Implementation Notes
   
   The entrypoint for this feature would start with an additional argument to the `SetupOptions`:
   
   https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/options/pipeline_options.py#L1105
   
   For the `DataflowRunner`, specifically, we could check if an `environment.yml` file was passed in, and choose the `conda` enabled python container following a pattern similar to the existing logic: 
   
   https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L466
   
   To me, it's somewhat of an open question for how we can add Anaconda support to other types of remote runners. 
   
   I received the tip from @yuvipanda, in a one-off discussion, that Mamba (https://mamba.readthedocs.io/en/latest/) might be a useful tool to integrate with over `miniconda3` specifically. Mamba honors the same interfaces from `conda` files, except it includes a faster implementation and dependency resolver. 
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] martindurant commented on issue #22675: [Feature Request]: Extend Python's Setup Options to install dependencies from an Anaconda `environment.yml` file.

Posted by GitBox <gi...@apache.org>.
martindurant commented on issue #22675:
URL: https://github.com/apache/beam/issues/22675#issuecomment-1212337849

   On mamba V conda, they should be equivalent, but there have been some corner cases; however, conda itself does optinally now support "libmamba" as a solver, so the two are slowly merging back together again.
   
   Conda works happily alongside pip and can work in its own docker or be installed into another docker as a layer over something else. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] alxmrs commented on issue #22675: [Feature Request]: Extend Python's Setup Options to install dependencies from an Anaconda `environment.yml` file.

Posted by GitBox <gi...@apache.org>.
alxmrs commented on issue #22675:
URL: https://github.com/apache/beam/issues/22675#issuecomment-1211340881

   CC: @martindurant


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org