You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/09/14 11:44:46 UTC

[GitHub] [beam] calvinleungyk commented on pull request #15105: [BEAM-11275] Defer remote package download in stager and GetArtifact from GCS

calvinleungyk commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-919072236


   For 1., I would argue that it's the user's responsibility to ensure the pipeline is reading a consistent set of artifacts. If the user doesn't freeze package version in Pythons requirements.txt, they will get inconsistent libraries downloaded across regular Python job invocation. Even for the GCS staging location here, it's still overwritable and nothing much is done to ensure the artifacts are consistent across pipeline runs.
   
   Custom containers could be a feasible solution and we are currently looking into it. The downside to this is users need to re-build a container every time they change or add dependencies/ files and this does not provide the best user experience (while being flexible). For the `setup.py` approach, users would need to learn how to write and structure the `setup.py` file before they can start testing with the pipeline, which increases the overhead and introduces friction for rapid experimentation.
   
   For our use case, users compile a TFX pipeline (that uses Beam) on a local machine with `extra_packages` and then send that to a remote machine in a Kubeflow cluster. When the Kubeflow machine runs the pipeline, it only has the pipeline but not the `extra_packages` files. As `extra_packages` only support local paths, the job that is launched on remote Kubeflow machine fails. In Dataflow runner's case, GCS buckets are already used as staging locations so it doesn't seem such a big change to defer GCS downloads to Dataflow workers.
   
   Alternatively, if we don't want to defer GCS downloads to Dataflow workers, we can use an approach similar to what was originally proposed in https://issues.apache.org/jira/browse/BEAM-11275, to support GCS path in [stager.py](https://github.com/apache/beam/blob/92aebe4d8837b6c5a598acc489e14c72348acd8c/sdks/python/apache_beam/runners/portability/stager.py#L504) so that the remote runner can download the package from a GCS path (not just local) and upload to staging bucket.
   
   Compared to rebuilding containers multiple times or having our users learn how to write and structure a `setup.py` properly, this provides the convenience that matches the existing user experience, so ideally we can merge this change for support GCS paths somewhere.
   
   Would be great if @aaltay and @ibzib could review this proposal. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org