You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Evgeniy (Jira)" <ji...@apache.org> on 2021/02/11 08:54:00 UTC

[jira] [Issue Comment Deleted] (BEAM-11275) Support GCS files for extra_requirements argument in Python Beam portable runners

     [ https://issues.apache.org/jira/browse/BEAM-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Evgeniy updated BEAM-11275:
---------------------------
    Comment: was deleted

(was: I have same issue with Beam and Flink (without GCS and Kubeflow). I use `FlinkRunner`

Pipeline options:
{code:java}
--runner=FlinkRunner,
--flink_master=*.*.*.*:8081
--environment_type=EXTERNAL
--environment_config=*.*.*.*:50000
--flink_version=1.10
{code}
 

 

Flink taskexecutor log:
{code:java}
Java.io.FileNotFoundException: /tmp/beam-temp3g1emmsd/artifactsbv63p2is/a9d47b4ae5a1f4a0b5fddd51d2cd152583637e1353a615d5457aa19494051826/1-ref_Environment_default_e-tfx_ephemeral-0.27.0.tar.gz (No such file or directory)
 at java.io.FileInputStream.open0(Native Method)
 at java.io.FileInputStream.open(FileInputStream.java:195)
 at java.io.FileInputStream.<init>(FileInputStream.java:138)
 at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:127)
 at org.apache.beam.sdk.io.LocalFileSystem.open(LocalFileSystem.java:83)
 at org.apache.beam.sdk.io.FileSystems.open(FileSystems.java:256)
 at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:124)
 at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:99)
 at org.apache.beam.model.jobmanagement.v1.ArtifactRetrievalServiceGrpc$MethodHandlers.invoke(ArtifactRetrievalServiceGrpc.java:327)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
 at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}
Beam worker pool log:
{code:java}
2021/02/10 12:28:10 Failed to retrieve staged files: failed to retrieve /tmp/staged in 3 attempts: failed to retrieve chunk for /tmp/staged/tfx_ephemeral-0.27.0.tar.gz
{code})

> Support GCS files for extra_requirements argument in Python Beam portable runners
> ---------------------------------------------------------------------------------
>
>                 Key: BEAM-11275
>                 URL: https://issues.apache.org/jira/browse/BEAM-11275
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Gerard Casas Saez
>            Assignee: Calvin Leung
>            Priority: P2
>
> Currently Portable runners only support locally available files for adding dependencies on remote workers. This can be seen in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/stager.py#L429 as it uses shutil.copyfile when it detects file is remote and it is not http.
> An easy extension would be to extend _is_remote_path in Stager to detect if the path matches any filesystem and if it does the avoid downloading and let it be copied afterwards. 
> Acceptance criteria:
> - `extra_package` can be a GCS path instead of requiring it to be local only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)