You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Claire McGinty <cl...@gmail.com> on 2021/09/09 18:36:47 UTC

[Proposal] Default tempDirectory setting in file-based IOs

Hi Beam devs,

I have a question/proposal about the default tempDirectory setting for
file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
optional tempDirectory setter, and when the transforms are expanded,
tempDirectory will default to the value of the final output directory if
null [AvroIO
<https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
/FileIO
<https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
/TextIO
<https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
].

I think it would make sense to default to the value of
PipelineOptions#getTempLocation instead, which is accessible inside
the expand(PCollection<T>
input) method; it seems reasonable for the user to expect that their
PipelineOptions#getTempLocation will be honored, and additionally, their
final output locations may have locks/retention policies set that make the
temp file renaming step fail. Plus, this pattern looks like it's already
being used in BigQueryIO
<https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
.

What do you think?

Thanks!
Claire

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Reza Rokni <re...@google.com>.
+1 on the proposal and double +1 for having consistency across I/O's!

On Thu, Sep 9, 2021 at 8:16 PM Ahmet Altay <al...@google.com> wrote:

> Adding relevant folks +Chamikara Jayalath <ch...@google.com> +Pablo
> Estrada <pa...@google.com>
>
> This proposal makes sense to me. It makes it easier for users to reason
> about why a temp directory is chosen, and would lead to a unified code
> across all IOs that does this.
>
> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <cl...@gmail.com>
> wrote:
>
>> Hi Beam devs,
>>
>> I have a question/proposal about the default tempDirectory setting for
>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>> optional tempDirectory setter, and when the transforms are expanded,
>> tempDirectory will default to the value of the final output directory if
>> null [AvroIO
>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
>> /FileIO
>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
>> /TextIO
>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
>> ].
>>
>> I think it would make sense to default to the value of
>> PipelineOptions#getTempLocation instead, which is accessible inside the expand(PCollection<T>
>> input) method; it seems reasonable for the user to expect that their
>> PipelineOptions#getTempLocation will be honored, and additionally, their
>> final output locations may have locks/retention policies set that make the
>> temp file renaming step fail. Plus, this pattern looks like it's already
>> being used in BigQueryIO
>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
>> .
>>
>> What do you think?
>>
>> Thanks!
>> Claire
>>
>>
>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Robert Bradshaw <ro...@google.com>.
I'd happily review such a PR.

On Thu, Sep 23, 2021, 7:44 AM Claire McGinty <cl...@gmail.com>
wrote:

> Thanks Cham and Robert! I think having a .withPipelineTempDirectory
> builder option in the file-based sink Builders would be helpful for our
> use case. I'm happy to put up a PR if there's consensus on this.
>
> Thanks,
> Claire
>
> On Fri, Sep 10, 2021 at 1:20 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Fri, Sep 10, 2021 at 9:24 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Being able to (cheaply) rename files is precisely why the temporary
>>> files are colocated with the final output destination. There are other
>>> benefits as well like automatically inheriting the right permissions,
>>> not having to worry about having sufficient disk/quota etc. in both
>>> the temporary and final place, etc. I don't think it makes sense to
>>> change this default. (I don't think the analogy with BigqueryIO holds,
>>> as there we have no "local" place to collate with.
>>>
>>> We could introduce a new withPipelineTempDirectory() method that would
>>> invoke withTempDirectory with the PipelineOptions.getTempLocation to
>>> make this easier.
>>>
>>
>> As others pointed out, getTempLocation can be a completely different
>> bucket or maybe even a different region for some users. So changing the
>> default could result in performance regressions for some users.
>>
>> BigQueryIO sink (in export mode)  is a bit different since we don't
>> perform a renaming of files to a final location. We just write temp files
>> and load them to BigQuery using load jobs.
>>
>> Also note that we create a uniquely named subdirectory within the temp
>> directory for each run of a pipeline. So there should not be conflicts
>> between different runs when using the same output directory:
>> https://github.com/apache/beam/blob/95c3811312ad41f98dd5ef0674ef30ce30448746/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L525
>>
>> Temp output directory can be overridden today using the
>> "withTempDirectory" option. I'm also OK with introducing a pipeline option
>> to make this easier if needed.
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty
>>> <cl...@gmail.com> wrote:
>>> >
>>> > Niel, that's a good point -- I don't think there's any restriction on
>>> the filesystem of PipelineOptions#tempLocation, I was able to run a job on
>>> DirectRunner with PipelineOptions#tempLocation set to a local path & my
>>> TextIO.Write#outputDirectory set to a remote filesystem.
>>> >
>>> > But currently there's also no check that would catch
>>> AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
>>> at graph construction time, either (it will fail at runtime instead). I
>>> think the FileSystems api has some logic for performing this check that we
>>> could extract into a utility method and use in PTransform#expand?
>>> >
>>> > - Claire
>>> >
>>> > On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <ni...@google.com>
>>> wrote:
>>> >>
>>> >> I have heard of an intermittent fault in avroIO where two independent
>>> pipelines using the same output directory deleted each others temp files
>>> while they were being written.
>>> >>
>>> >> I cannot reproduce the problem, nor have I found a code path that
>>> could cause it in fileio (with a superficial look), but it has been
>>> reported more than once...
>>> >>
>>> >>
>>> >>
>>> >> Back to the original question, is it possible that the tempDirectory
>>> from PipelineOptions points to a different filesystem than used for the
>>> fileio output? Because this could break transforms that write to a tempfile
>>> in the destination filesystem then rename the tempfile to the final output
>>> filename when writing is complete.
>>> >>
>>> >> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:
>>> >>>
>>> >>> While this makes sense, I can also see a risk of breaking existing
>>> users by changing the default like that. In addition it might be a
>>> less-performant default. A file rename that takes place within the same
>>> location might be performed via a fast rename operation, where otherwise it
>>> would be forced to generate an expensive copy + delete. This could cause
>>> strange and hard-to-debug performance problems when people upgrade to the
>>> newer Beam version.
>>> >>>
>>> >>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
>>> >>>>
>>> >>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada
>>> >>>>
>>> >>>> This proposal makes sense to me. It makes it easier for users to
>>> reason about why a temp directory is chosen, and would lead to a unified
>>> code across all IOs that does this.
>>> >>>>
>>> >>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>>> claire.d.mcginty@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Hi Beam devs,
>>> >>>>>
>>> >>>>> I have a question/proposal about the default tempDirectory setting
>>> for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>>> optional tempDirectory setter, and when the transforms are expanded,
>>> tempDirectory will default to the value of the final output directory if
>>> null [AvroIO/FileIO/TextIO].
>>> >>>>>
>>> >>>>> I think it would make sense to default to the value of
>>> PipelineOptions#getTempLocation instead, which is accessible inside the
>>> expand(PCollection<T> input) method; it seems reasonable for the user to
>>> expect that their PipelineOptions#getTempLocation will be honored, and
>>> additionally, their final output locations may have locks/retention
>>> policies set that make the temp file renaming step fail. Plus, this pattern
>>> looks like it's already being used in BigQueryIO.
>>> >>>>>
>>> >>>>> What do you think?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>> Claire
>>> >>>>>
>>> >>>>>
>>>
>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Claire McGinty <cl...@gmail.com>.
Thanks Cham and Robert! I think having a .withPipelineTempDirectory builder
option in the file-based sink Builders would be helpful for our use case.
I'm happy to put up a PR if there's consensus on this.

Thanks,
Claire

On Fri, Sep 10, 2021 at 1:20 PM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Fri, Sep 10, 2021 at 9:24 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Being able to (cheaply) rename files is precisely why the temporary
>> files are colocated with the final output destination. There are other
>> benefits as well like automatically inheriting the right permissions,
>> not having to worry about having sufficient disk/quota etc. in both
>> the temporary and final place, etc. I don't think it makes sense to
>> change this default. (I don't think the analogy with BigqueryIO holds,
>> as there we have no "local" place to collate with.
>>
>> We could introduce a new withPipelineTempDirectory() method that would
>> invoke withTempDirectory with the PipelineOptions.getTempLocation to
>> make this easier.
>>
>
> As others pointed out, getTempLocation can be a completely different
> bucket or maybe even a different region for some users. So changing the
> default could result in performance regressions for some users.
>
> BigQueryIO sink (in export mode)  is a bit different since we don't
> perform a renaming of files to a final location. We just write temp files
> and load them to BigQuery using load jobs.
>
> Also note that we create a uniquely named subdirectory within the temp
> directory for each run of a pipeline. So there should not be conflicts
> between different runs when using the same output directory:
> https://github.com/apache/beam/blob/95c3811312ad41f98dd5ef0674ef30ce30448746/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L525
>
> Temp output directory can be overridden today using the
> "withTempDirectory" option. I'm also OK with introducing a pipeline option
> to make this easier if needed.
>
> Thanks,
> Cham
>
>
>>
>> On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty
>> <cl...@gmail.com> wrote:
>> >
>> > Niel, that's a good point -- I don't think there's any restriction on
>> the filesystem of PipelineOptions#tempLocation, I was able to run a job on
>> DirectRunner with PipelineOptions#tempLocation set to a local path & my
>> TextIO.Write#outputDirectory set to a remote filesystem.
>> >
>> > But currently there's also no check that would catch
>> AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
>> at graph construction time, either (it will fail at runtime instead). I
>> think the FileSystems api has some logic for performing this check that we
>> could extract into a utility method and use in PTransform#expand?
>> >
>> > - Claire
>> >
>> > On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <ni...@google.com> wrote:
>> >>
>> >> I have heard of an intermittent fault in avroIO where two independent
>> pipelines using the same output directory deleted each others temp files
>> while they were being written.
>> >>
>> >> I cannot reproduce the problem, nor have I found a code path that
>> could cause it in fileio (with a superficial look), but it has been
>> reported more than once...
>> >>
>> >>
>> >>
>> >> Back to the original question, is it possible that the tempDirectory
>> from PipelineOptions points to a different filesystem than used for the
>> fileio output? Because this could break transforms that write to a tempfile
>> in the destination filesystem then rename the tempfile to the final output
>> filename when writing is complete.
>> >>
>> >> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:
>> >>>
>> >>> While this makes sense, I can also see a risk of breaking existing
>> users by changing the default like that. In addition it might be a
>> less-performant default. A file rename that takes place within the same
>> location might be performed via a fast rename operation, where otherwise it
>> would be forced to generate an expensive copy + delete. This could cause
>> strange and hard-to-debug performance problems when people upgrade to the
>> newer Beam version.
>> >>>
>> >>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
>> >>>>
>> >>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada
>> >>>>
>> >>>> This proposal makes sense to me. It makes it easier for users to
>> reason about why a temp directory is chosen, and would lead to a unified
>> code across all IOs that does this.
>> >>>>
>> >>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>> claire.d.mcginty@gmail.com> wrote:
>> >>>>>
>> >>>>> Hi Beam devs,
>> >>>>>
>> >>>>> I have a question/proposal about the default tempDirectory setting
>> for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>> optional tempDirectory setter, and when the transforms are expanded,
>> tempDirectory will default to the value of the final output directory if
>> null [AvroIO/FileIO/TextIO].
>> >>>>>
>> >>>>> I think it would make sense to default to the value of
>> PipelineOptions#getTempLocation instead, which is accessible inside the
>> expand(PCollection<T> input) method; it seems reasonable for the user to
>> expect that their PipelineOptions#getTempLocation will be honored, and
>> additionally, their final output locations may have locks/retention
>> policies set that make the temp file renaming step fail. Plus, this pattern
>> looks like it's already being used in BigQueryIO.
>> >>>>>
>> >>>>> What do you think?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Claire
>> >>>>>
>> >>>>>
>>
>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Chamikara Jayalath <ch...@google.com>.
On Fri, Sep 10, 2021 at 9:24 AM Robert Bradshaw <ro...@google.com> wrote:

> Being able to (cheaply) rename files is precisely why the temporary
> files are colocated with the final output destination. There are other
> benefits as well like automatically inheriting the right permissions,
> not having to worry about having sufficient disk/quota etc. in both
> the temporary and final place, etc. I don't think it makes sense to
> change this default. (I don't think the analogy with BigqueryIO holds,
> as there we have no "local" place to collate with.
>
> We could introduce a new withPipelineTempDirectory() method that would
> invoke withTempDirectory with the PipelineOptions.getTempLocation to
> make this easier.
>

As others pointed out, getTempLocation can be a completely different bucket
or maybe even a different region for some users. So changing the default
could result in performance regressions for some users.

BigQueryIO sink (in export mode)  is a bit different since we don't perform
a renaming of files to a final location. We just write temp files and load
them to BigQuery using load jobs.

Also note that we create a uniquely named subdirectory within the temp
directory for each run of a pipeline. So there should not be conflicts
between different runs when using the same output directory:
https://github.com/apache/beam/blob/95c3811312ad41f98dd5ef0674ef30ce30448746/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L525

Temp output directory can be overridden today using the "withTempDirectory"
option. I'm also OK with introducing a pipeline option to make this easier
if needed.

Thanks,
Cham


>
> On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty
> <cl...@gmail.com> wrote:
> >
> > Niel, that's a good point -- I don't think there's any restriction on
> the filesystem of PipelineOptions#tempLocation, I was able to run a job on
> DirectRunner with PipelineOptions#tempLocation set to a local path & my
> TextIO.Write#outputDirectory set to a remote filesystem.
> >
> > But currently there's also no check that would catch
> AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
> at graph construction time, either (it will fail at runtime instead). I
> think the FileSystems api has some logic for performing this check that we
> could extract into a utility method and use in PTransform#expand?
> >
> > - Claire
> >
> > On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <ni...@google.com> wrote:
> >>
> >> I have heard of an intermittent fault in avroIO where two independent
> pipelines using the same output directory deleted each others temp files
> while they were being written.
> >>
> >> I cannot reproduce the problem, nor have I found a code path that could
> cause it in fileio (with a superficial look), but it has been reported more
> than once...
> >>
> >>
> >>
> >> Back to the original question, is it possible that the tempDirectory
> from PipelineOptions points to a different filesystem than used for the
> fileio output? Because this could break transforms that write to a tempfile
> in the destination filesystem then rename the tempfile to the final output
> filename when writing is complete.
> >>
> >> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:
> >>>
> >>> While this makes sense, I can also see a risk of breaking existing
> users by changing the default like that. In addition it might be a
> less-performant default. A file rename that takes place within the same
> location might be performed via a fast rename operation, where otherwise it
> would be forced to generate an expensive copy + delete. This could cause
> strange and hard-to-debug performance problems when people upgrade to the
> newer Beam version.
> >>>
> >>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
> >>>>
> >>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada
> >>>>
> >>>> This proposal makes sense to me. It makes it easier for users to
> reason about why a temp directory is chosen, and would lead to a unified
> code across all IOs that does this.
> >>>>
> >>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
> claire.d.mcginty@gmail.com> wrote:
> >>>>>
> >>>>> Hi Beam devs,
> >>>>>
> >>>>> I have a question/proposal about the default tempDirectory setting
> for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
> optional tempDirectory setter, and when the transforms are expanded,
> tempDirectory will default to the value of the final output directory if
> null [AvroIO/FileIO/TextIO].
> >>>>>
> >>>>> I think it would make sense to default to the value of
> PipelineOptions#getTempLocation instead, which is accessible inside the
> expand(PCollection<T> input) method; it seems reasonable for the user to
> expect that their PipelineOptions#getTempLocation will be honored, and
> additionally, their final output locations may have locks/retention
> policies set that make the temp file renaming step fail. Plus, this pattern
> looks like it's already being used in BigQueryIO.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> Thanks!
> >>>>> Claire
> >>>>>
> >>>>>
>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Robert Bradshaw <ro...@google.com>.
Being able to (cheaply) rename files is precisely why the temporary
files are colocated with the final output destination. There are other
benefits as well like automatically inheriting the right permissions,
not having to worry about having sufficient disk/quota etc. in both
the temporary and final place, etc. I don't think it makes sense to
change this default. (I don't think the analogy with BigqueryIO holds,
as there we have no "local" place to collate with.

We could introduce a new withPipelineTempDirectory() method that would
invoke withTempDirectory with the PipelineOptions.getTempLocation to
make this easier.

On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty
<cl...@gmail.com> wrote:
>
> Niel, that's a good point -- I don't think there's any restriction on the filesystem of PipelineOptions#tempLocation, I was able to run a job on DirectRunner with PipelineOptions#tempLocation set to a local path & my TextIO.Write#outputDirectory set to a remote filesystem.
>
> But currently there's also no check that would catch AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>) at graph construction time, either (it will fail at runtime instead). I think the FileSystems api has some logic for performing this check that we could extract into a utility method and use in PTransform#expand?
>
> - Claire
>
> On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <ni...@google.com> wrote:
>>
>> I have heard of an intermittent fault in avroIO where two independent pipelines using the same output directory deleted each others temp files while they were being written.
>>
>> I cannot reproduce the problem, nor have I found a code path that could cause it in fileio (with a superficial look), but it has been reported more than once...
>>
>>
>>
>> Back to the original question, is it possible that the tempDirectory from PipelineOptions points to a different filesystem than used for the fileio output? Because this could break transforms that write to a tempfile in the destination filesystem then rename the tempfile to the final output filename when writing is complete.
>>
>> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:
>>>
>>> While this makes sense, I can also see a risk of breaking existing users by changing the default like that. In addition it might be a less-performant default. A file rename that takes place within the same location might be performed via a fast rename operation, where otherwise it would be forced to generate an expensive copy + delete. This could cause strange and hard-to-debug performance problems when people upgrade to the newer Beam version.
>>>
>>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada
>>>>
>>>> This proposal makes sense to me. It makes it easier for users to reason about why a temp directory is chosen, and would lead to a unified code across all IOs that does this.
>>>>
>>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <cl...@gmail.com> wrote:
>>>>>
>>>>> Hi Beam devs,
>>>>>
>>>>> I have a question/proposal about the default tempDirectory setting for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an optional tempDirectory setter, and when the transforms are expanded, tempDirectory will default to the value of the final output directory if null [AvroIO/FileIO/TextIO].
>>>>>
>>>>> I think it would make sense to default to the value of PipelineOptions#getTempLocation instead, which is accessible inside the expand(PCollection<T> input) method; it seems reasonable for the user to expect that their PipelineOptions#getTempLocation will be honored, and additionally, their final output locations may have locks/retention policies set that make the temp file renaming step fail. Plus, this pattern looks like it's already being used in BigQueryIO.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Thanks!
>>>>> Claire
>>>>>
>>>>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Claire McGinty <cl...@gmail.com>.
Niel, that's a good point -- I don't think there's any restriction on the
filesystem of PipelineOptions#tempLocation, I was able to run a job on
DirectRunner with PipelineOptions#tempLocation set to a local path & my
TextIO.Write#outputDirectory set to a remote filesystem.

But currently there's also no check that would catch
AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
at graph construction time, either (it will fail at runtime instead). I
think the FileSystems api
<https://github.com/apache/beam/blob/3a3a84483ba1d279d3d5ff53ecf6f0b925cece3d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L477>
has
some logic for performing this check that we could extract into a utility
method and use in PTransform#expand?

- Claire

On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <ni...@google.com> wrote:

> I have heard of an intermittent fault in avroIO where two independent
> pipelines using the same output directory deleted each others temp files
> while they were being written.
>
> I cannot reproduce the problem, nor have I found a code path that could
> cause it in fileio (with a superficial look), but it has been reported more
> than once...
>
>
>
> Back to the original question, is it possible that the tempDirectory from
> PipelineOptions points to a different filesystem than used for the fileio
> output? Because this could break transforms that write to a tempfile in the
> destination filesystem then rename the tempfile to the final output
> filename when writing is complete.
>
> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:
>
>> While this makes sense, I can also see a risk of breaking existing users
>> by changing the default like that. In addition it might be a
>> less-performant default. A file rename that takes place within the same
>> location might be performed via a fast rename operation, where otherwise it
>> would be forced to generate an expensive copy + delete. This could cause
>> strange and hard-to-debug performance problems when people upgrade to the
>> newer Beam version.
>>
>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> Adding relevant folks +Chamikara Jayalath <ch...@google.com> +Pablo
>>> Estrada <pa...@google.com>
>>>
>>> This proposal makes sense to me. It makes it easier for users to reason
>>> about why a temp directory is chosen, and would lead to a unified code
>>> across all IOs that does this.
>>>
>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>>> claire.d.mcginty@gmail.com> wrote:
>>>
>>>> Hi Beam devs,
>>>>
>>>> I have a question/proposal about the default tempDirectory setting for
>>>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>>>> optional tempDirectory setter, and when the transforms are expanded,
>>>> tempDirectory will default to the value of the final output directory if
>>>> null [AvroIO
>>>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
>>>> /FileIO
>>>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
>>>> /TextIO
>>>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
>>>> ].
>>>>
>>>> I think it would make sense to default to the value of
>>>> PipelineOptions#getTempLocation instead, which is accessible inside
>>>> the expand(PCollection<T> input) method; it seems reasonable for the
>>>> user to expect that their PipelineOptions#getTempLocation will be
>>>> honored, and additionally, their final output locations may
>>>> have locks/retention policies set that make the temp file renaming step
>>>> fail. Plus, this pattern looks like it's already being used in
>>>> BigQueryIO
>>>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
>>>> .
>>>>
>>>> What do you think?
>>>>
>>>> Thanks!
>>>> Claire
>>>>
>>>>
>>>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Niel Markwick <ni...@google.com>.
I have heard of an intermittent fault in avroIO where two independent
pipelines using the same output directory deleted each others temp files
while they were being written.

I cannot reproduce the problem, nor have I found a code path that could
cause it in fileio (with a superficial look), but it has been reported more
than once...



Back to the original question, is it possible that the tempDirectory from
PipelineOptions points to a different filesystem than used for the fileio
output? Because this could break transforms that write to a tempfile in the
destination filesystem then rename the tempfile to the final output
filename when writing is complete.

On Fri, 10 Sep 2021, 07:31 Reuven Lax, <re...@google.com> wrote:

> While this makes sense, I can also see a risk of breaking existing users
> by changing the default like that. In addition it might be a
> less-performant default. A file rename that takes place within the same
> location might be performed via a fast rename operation, where otherwise it
> would be forced to generate an expensive copy + delete. This could cause
> strange and hard-to-debug performance problems when people upgrade to the
> newer Beam version.
>
> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:
>
>> Adding relevant folks +Chamikara Jayalath <ch...@google.com> +Pablo
>> Estrada <pa...@google.com>
>>
>> This proposal makes sense to me. It makes it easier for users to reason
>> about why a temp directory is chosen, and would lead to a unified code
>> across all IOs that does this.
>>
>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>> claire.d.mcginty@gmail.com> wrote:
>>
>>> Hi Beam devs,
>>>
>>> I have a question/proposal about the default tempDirectory setting for
>>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>>> optional tempDirectory setter, and when the transforms are expanded,
>>> tempDirectory will default to the value of the final output directory if
>>> null [AvroIO
>>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
>>> /FileIO
>>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
>>> /TextIO
>>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
>>> ].
>>>
>>> I think it would make sense to default to the value of
>>> PipelineOptions#getTempLocation instead, which is accessible inside the expand(PCollection<T>
>>> input) method; it seems reasonable for the user to expect that their
>>> PipelineOptions#getTempLocation will be honored, and additionally,
>>> their final output locations may have locks/retention policies set that
>>> make the temp file renaming step fail. Plus, this pattern looks like it's
>>> already being used in BigQueryIO
>>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
>>> .
>>>
>>> What do you think?
>>>
>>> Thanks!
>>> Claire
>>>
>>>
>>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Reuven Lax <re...@google.com>.
While this makes sense, I can also see a risk of breaking existing users by
changing the default like that. In addition it might be a less-performant
default. A file rename that takes place within the same location might be
performed via a fast rename operation, where otherwise it would be forced
to generate an expensive copy + delete. This could cause strange and
hard-to-debug performance problems when people upgrade to the newer Beam
version.

On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <al...@google.com> wrote:

> Adding relevant folks +Chamikara Jayalath <ch...@google.com> +Pablo
> Estrada <pa...@google.com>
>
> This proposal makes sense to me. It makes it easier for users to reason
> about why a temp directory is chosen, and would lead to a unified code
> across all IOs that does this.
>
> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <cl...@gmail.com>
> wrote:
>
>> Hi Beam devs,
>>
>> I have a question/proposal about the default tempDirectory setting for
>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>> optional tempDirectory setter, and when the transforms are expanded,
>> tempDirectory will default to the value of the final output directory if
>> null [AvroIO
>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
>> /FileIO
>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
>> /TextIO
>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
>> ].
>>
>> I think it would make sense to default to the value of
>> PipelineOptions#getTempLocation instead, which is accessible inside the expand(PCollection<T>
>> input) method; it seems reasonable for the user to expect that their
>> PipelineOptions#getTempLocation will be honored, and additionally, their
>> final output locations may have locks/retention policies set that make the
>> temp file renaming step fail. Plus, this pattern looks like it's already
>> being used in BigQueryIO
>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
>> .
>>
>> What do you think?
>>
>> Thanks!
>> Claire
>>
>>
>>

Re: [Proposal] Default tempDirectory setting in file-based IOs

Posted by Ahmet Altay <al...@google.com>.
Adding relevant folks +Chamikara Jayalath <ch...@google.com> +Pablo
Estrada <pa...@google.com>

This proposal makes sense to me. It makes it easier for users to reason
about why a temp directory is chosen, and would lead to a unified code
across all IOs that does this.

On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <cl...@gmail.com>
wrote:

> Hi Beam devs,
>
> I have a question/proposal about the default tempDirectory setting for
> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
> optional tempDirectory setter, and when the transforms are expanded,
> tempDirectory will default to the value of the final output directory if
> null [AvroIO
> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
> /FileIO
> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
> /TextIO
> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
> ].
>
> I think it would make sense to default to the value of
> PipelineOptions#getTempLocation instead, which is accessible inside the expand(PCollection<T>
> input) method; it seems reasonable for the user to expect that their
> PipelineOptions#getTempLocation will be honored, and additionally, their
> final output locations may have locks/retention policies set that make the
> temp file renaming step fail. Plus, this pattern looks like it's already
> being used in BigQueryIO
> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
> .
>
> What do you think?
>
> Thanks!
> Claire
>
>
>