You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Gerard Casas Saez (Jira)" <ji...@apache.org> on 2021/01/05 22:46:00 UTC

[jira] [Comment Edited] (BEAM-11275) Support GCS files for extra_requirements argument in Python Beam portable runners

    [ https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259268#comment-17259268 ] 

Gerard Casas Saez edited comment on BEAM-11275 at 1/5/21, 10:45 PM:
--------------------------------------------------------------------

This is my stacktrace (I reported it on TFX): https://github.com/tensorflow/tfx/issues/2839

But no need, if you see https://github.com/apache/beam/blob/47ca61a40e8837d6687dcf29d4457ba6b24259a3/sdks/python/apache_beam/runners/portability/stager.py#L496 this checks if it's remote file, but _download_file only works when remote file is https:// as seen in https://github.com/apache/beam/blob/47ca61a40e8837d6687dcf29d4457ba6b24259a3/sdks/python/apache_beam/runners/portability/stager.py#L371 it does not handle `gs://` prefix well. The issue is that for remote file shutil.copyfile is used which does not support GCS path.


was (Author: gcasassaez):
This is my stacktrace (I reported it on TFX): https://github.com/tensorflow/tfx/issues/2839

But no need, if you see https://github.com/apache/beam/blob/47ca61a40e8837d6687dcf29d4457ba6b24259a3/sdks/python/apache_beam/runners/portability/stager.py#L496 this checks if it's remote file, but _download_file only works when remote file is https:// as seen in https://github.com/apache/beam/blob/47ca61a40e8837d6687dcf29d4457ba6b24259a3/sdks/python/apache_beam/runners/portability/stager.py#L371 it does not handle `gs://` prefix well.

> Support GCS files for extra_requirements argument in Python Beam portable runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-universal, sdk-py-core
>            Reporter: Gerard Casas Saez
>            Priority: P2
>              Labels: starter
>
> Currently Portable runners only support locally available files for adding dependencies on remote workers. This can be seen in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429 as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if the path matches any filesystem and if it does the avoid downloading and let it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)