You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/08/13 23:35:31 UTC

[GitHub] [beam] ihji commented on pull request #15105: [BEAM-11275] Defer remote package download in stager and GetArtifact from GCS

ihji commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-898768860


   Is there any reason you want to download remote packages from the SDK harness? From the perspective of Dataflow runner, I personally don't see much benefit since downloading from third-party services (Azure, AWS, PyPI, Maven, etc.) every time the SDK harness boot-up seems vulnerable to network instability or third-party service failures. Dataflow job would fail when third-party services are unavailable even all GCP services are green.
   
   The SDK harness won't download anything itself based on `extra_packages.txt`. All package files listed in `extra_packages.txt` should already exist in Docker container's staging location when pip installs them. To implement deferred remote package download, you need to 1) create URL artifact information for remote artifacts and add only local file names (matched with `staging_to` names from URL artifact information) in `extra_packages.txt` 2) make sure that URL artifact information is passed through the SDK harness without being materialized during job submission 3) extend `materialize.go` to support  non-GCS URL artifact information.
   
   Also, please note that Dataflow uses some google-internal Python SDK harness boot-up codes. So this PR could not be merged (at least for a few months) until Dataflow fully migrate to the public Python SDK harness container.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org