You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/10/06 08:48:00 UTC

[jira] [Work logged] (BEAM-12875) File systems are not registered when ArtifactRetrievalService is created by Spark runner

     [ https://issues.apache.org/jira/browse/BEAM-12875?focusedWorklogId=660779&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-660779 ]

ASF GitHub Bot logged work on BEAM-12875:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Oct/21 08:47
            Start Date: 06/Oct/21 08:47
    Worklog Time Spent: 10m 
      Work Description: meowcakes commented on pull request #15502:
URL: https://github.com/apache/beam/pull/15502#issuecomment-935770232


   @pabloem Updated to address comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 660779)
    Time Spent: 1h  (was: 50m)

> File systems are not registered when ArtifactRetrievalService is created by Spark runner
> ----------------------------------------------------------------------------------------
>
>                 Key: BEAM-12875
>                 URL: https://issues.apache.org/jira/browse/BEAM-12875
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>    Affects Versions: 2.32.0
>            Reporter: Rogan Morrow
>            Priority: P2
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> I am new to this codebase so apologies if I have any misunderstandings, but from what I can tell when {{SparkExecutableStageFunction}} is called an {{ArtifactRetrievalService}} is created (if the job bundle factory's environment cache is cold) to be called by the worker harness.
> The issue is that {{FileSystems.setDefaultPipelineOptions}} is not called before this, so no filesystems are registered. If one is using cloud storage such as S3 to stage artifacts, then the {{ArtifactRetrievalService}} will not be able to retrieve the artifacts and throw an exception:
>   {{java.lang.IllegalArgumentException: No filesystem found for scheme s3}}
> This doesn't affect other runners such as the Flink runner because it calls {{FileSystems.setDefaultPipelineOptions}} [in its executable stage function |https://github.com/apache/beam/blob/v2.32.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkExecutableStageFunction.java#L151]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)