You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2022/01/16 14:26:00 UTC

[jira] [Commented] (BEAM-10241) Dataflow template sharing temp directory in FileBasedSink which may cause a job deleting temp files generated by another job

    [ https://issues.apache.org/jira/browse/BEAM-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476802#comment-17476802 ] 

Kenneth Knowles commented on BEAM-10241:
----------------------------------------

The expectation here is that {{getTempLocation}} for different jobs is a directory entirely owned by the job. The relative path under the temp location is assumed to be safe. So if two jobs created from the same template have the same full temp location that is actually a bug in the Dataflow template launcher service, and nothing that Beam would fix.

> Dataflow template sharing temp directory in FileBasedSink which may cause a job deleting temp files generated by another job
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-10241
>                 URL: https://issues.apache.org/jira/browse/BEAM-10241
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>            Reporter: Minbo Bae
>            Priority: P2
>
> The temp directory in FileBasedSink consists of  output + ".temp-beam-" + [UUID.randomUUID()|[https://github.com/apache/beam/blob/v2.14.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L521]].
> By the way, the part of  ".temp-beam-" + UUID.randomUUID()" is fixed when the pipeline is uploaded as Dataflow template, and all the jobs created from the Dataflow template use the same temp_directory, if their output directories are the same.
> This may cause a job deletes temp files generated by another job when the outputs of concurrent template jobs have the same directory (e.g. gs://df-job/output/job1.out and gs://df-job/output/job2.out)  
> It looks like [BigQueryIO|[https://github.com/apache/beam/blob/v2.22.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1075-L1163]] creates a BQ job id at execution time in the case of Dataflow template. Can we make a similar fix for FileBasedSink? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)