You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/09/04 01:29:00 UTC

[jira] [Work logged] (BEAM-11275) Support GCS files for extra_requirements argument in Python Beam portable runners

     [ https://issues.apache.org/jira/browse/BEAM-11275?focusedWorklogId=646546&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-646546 ]

ASF GitHub Bot logged work on BEAM-11275:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Sep/21 01:28
            Start Date: 04/Sep/21 01:28
    Worklog Time Spent: 10m 
      Work Description: aaltay commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-912884587


   I do not believe we should make this change. And this is not related to Google's private packages. The reasons are:
   - Staging packages creates a consistent set of artifacts throughout the lifetime of a pipeline. (Consider a long running streaming job, with autoscaling etc, different works might be started with days of gap). Downloading from a location other than staging location could result in different dependencies in different workers (e.g. a package getting an update in pypi).
   - If users want to do this they can do it by using a custom containers. Beam's custom container protocol (https://s.apache.org/beam-fn-api-container-contract) and allows for a completely custom container with support for changes like this (https://beam.apache.org/documentation/runtime/environments/)
   - It is also possible to achieve the results here even without using custom containers. (This is option is good for legacy Dataflow pipelines as well) by using a custom commands setup.py file (https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython). This file can execute any code at worker startup time including downloading and installation of arbitrary dependencies. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 646546)
    Time Spent: 8h 40m  (was: 8.5h)

> Support GCS files for extra_requirements argument in Python Beam portable runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Gerard Casas Saez
>            Assignee: Calvin Leung
>            Priority: P2
>          Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Currently Portable runners only support locally available files for adding dependencies on remote workers. This can be seen in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429 as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if the path matches any filesystem and if it does the avoid downloading and let it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)