You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "tvalentyn (via GitHub)" <gi...@apache.org> on 2023/09/06 16:04:14 UTC

[GitHub] [beam] tvalentyn opened a new issue, #28331: [Feature Request]: Provide a user-facing api to stage and download large file dependencies onto Beam SDK workers

tvalentyn opened a new issue, #28331:
URL: https://github.com/apache/beam/issues/28331

   ### What would you like to happen?
   
   Users sometimes need to provision large files to SDK workers.
   
   Beam Artifact staging API capabilities are not directly exposed to Python SDK users, beyond options to stage well-defined Python dependency artifacts, such as `--extra_package`, see: https://github.com/apache/beam/blob/7a4cbc18f97b4795eb00d4f14bc0790c564e5c9e/sdks/python/apache_beam/runners/portability/stager.py#L165 
   
   Currently available options for staging large resources:
   * If you need to stage a large model to run predictions, consider Beam RunInference API instead. The API already takes care of downloading the model and might improve overtime.
   * Include your data dependency in custom containers. This increases container image size, and worker startup will be slower. Because Docker compresses images, not only downloading time will increase but also decompressing the container image during the pull. Also Dataflow runner currently needs additional flags to run large container images (increase the default `--disk_size_gb=...`, use `--experiments=disable_worker_container_image_prepull`)
   * Use a python package that will download a large file upon package installation. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython . A custom `gsutil cp` command can be used in https://github.com/apache/beam/blob/99b2f7bd7939203138d4a5e18463339455fda461/sdks/python/apache_beam/examples/complete/juliaset/setup.py#L79 . 
   * Use a custom container with a custom entrypoint that will download a data dependency in (e.g. via `gsutil cp`) command before starting Beam SDK workers: https://cloud.google.com/dataflow/docs/guides/using-custom-containers#custom-entrypoint.
   * Use [shared.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/shared.py) and download  the dependency once per process or use  [multi_process_shared.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/multi_process_shared.py) to download an artifact once per machine. Beam RunInference transforms uses these utilities and fettch models via FileSystems API, example: https://github.com/apache/beam/blob/99b2f7bd7939203138d4a5e18463339455fda461/sdks/python/apache_beam/ml/inference/sklearn_inference.py#L59. 
   
   Some options are not straightforward if not too hacky and some have disadvantages in usability or performance. A user-facing API dedicated to staging data dependencies can fill in the gap and provide a more robust handling of staging large files. The API can be consumed by Beam users directly, and by Beam Transforms, such as RunInference, for declaring and staging data dependencies of specific transform.
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org