You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Jozef Vilcek <jo...@gmail.com> on 2018/07/26 09:39:57 UTC

FileBasedSink.WriteOperation copy instead of move?

Hello,

just came across FileBasedSink.WriteOperation class which does have
moveToOutput() method. Implementation does a Filesystem.copy() instead of
"move". With large files I find it quote no efficient if underlying FS
supports more efficient ways, so I wonder what is the story behind it? Must
it be a copy?

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Chamikara Jayalath <ch...@google.com>.
Yeah, please file a JIRA.

- Cham

On Thu, Jul 26, 2018 at 11:33 AM Jozef Vilcek <jo...@gmail.com> wrote:

> Yes, rename can be tricky with cross-directory. This is related
> https://issues.apache.org/jira/browse/BEAM-4861
> I guess I can file a JIRA for this, right?
>
> On Thu, Jul 26, 2018 at 7:31 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
>> supporting failures of the rename step. I think this is a good change to do
>> if the change significantly improves the performance of some of the
>> FileSystems (note that some FileSystems, for example GCS, implement rename
>> in the form of a copy+delete, so there will be no significant performance
>> improvements for such FileSystems).
>>
>> -Cham
>>
>> On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:
>>
>>> We might be able to replace this with Filesystem.rename(). One thing to
>>> keep in mind - the destination files might be in a different directory, so
>>> we would need to make sure that all Filesystems support cross-directory
>>> rename.
>>>
>>> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> +dev
>>>>
>>>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> just came across FileBasedSink.WriteOperation class which does have
>>>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>>>> "move". With large files I find it quote no efficient if underlying FS
>>>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>>>> it be a copy?
>>>>>
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>>>
>>>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Chamikara Jayalath <ch...@google.com>.
Yeah, please file a JIRA.

- Cham

On Thu, Jul 26, 2018 at 11:33 AM Jozef Vilcek <jo...@gmail.com> wrote:

> Yes, rename can be tricky with cross-directory. This is related
> https://issues.apache.org/jira/browse/BEAM-4861
> I guess I can file a JIRA for this, right?
>
> On Thu, Jul 26, 2018 at 7:31 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
>> supporting failures of the rename step. I think this is a good change to do
>> if the change significantly improves the performance of some of the
>> FileSystems (note that some FileSystems, for example GCS, implement rename
>> in the form of a copy+delete, so there will be no significant performance
>> improvements for such FileSystems).
>>
>> -Cham
>>
>> On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:
>>
>>> We might be able to replace this with Filesystem.rename(). One thing to
>>> keep in mind - the destination files might be in a different directory, so
>>> we would need to make sure that all Filesystems support cross-directory
>>> rename.
>>>
>>> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> +dev
>>>>
>>>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> just came across FileBasedSink.WriteOperation class which does have
>>>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>>>> "move". With large files I find it quote no efficient if underlying FS
>>>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>>>> it be a copy?
>>>>>
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>>>
>>>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Jozef Vilcek <jo...@gmail.com>.
Yes, rename can be tricky with cross-directory. This is related
https://issues.apache.org/jira/browse/BEAM-4861
I guess I can file a JIRA for this, right?

On Thu, Jul 26, 2018 at 7:31 PM Chamikara Jayalath <ch...@google.com>
wrote:

> Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
> supporting failures of the rename step. I think this is a good change to do
> if the change significantly improves the performance of some of the
> FileSystems (note that some FileSystems, for example GCS, implement rename
> in the form of a copy+delete, so there will be no significant performance
> improvements for such FileSystems).
>
> -Cham
>
> On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:
>
>> We might be able to replace this with Filesystem.rename(). One thing to
>> keep in mind - the destination files might be in a different directory, so
>> we would need to make sure that all Filesystems support cross-directory
>> rename.
>>
>> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> +dev
>>>
>>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> just came across FileBasedSink.WriteOperation class which does have
>>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>>> "move". With large files I find it quote no efficient if underlying FS
>>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>>> it be a copy?
>>>>
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>>
>>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Jozef Vilcek <jo...@gmail.com>.
Yes, rename can be tricky with cross-directory. This is related
https://issues.apache.org/jira/browse/BEAM-4861
I guess I can file a JIRA for this, right?

On Thu, Jul 26, 2018 at 7:31 PM Chamikara Jayalath <ch...@google.com>
wrote:

> Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
> supporting failures of the rename step. I think this is a good change to do
> if the change significantly improves the performance of some of the
> FileSystems (note that some FileSystems, for example GCS, implement rename
> in the form of a copy+delete, so there will be no significant performance
> improvements for such FileSystems).
>
> -Cham
>
> On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:
>
>> We might be able to replace this with Filesystem.rename(). One thing to
>> keep in mind - the destination files might be in a different directory, so
>> we would need to make sure that all Filesystems support cross-directory
>> rename.
>>
>> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> +dev
>>>
>>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> just came across FileBasedSink.WriteOperation class which does have
>>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>>> "move". With large files I find it quote no efficient if underlying FS
>>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>>> it be a copy?
>>>>
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>>
>>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Chamikara Jayalath <ch...@google.com>.
Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
supporting failures of the rename step. I think this is a good change to do
if the change significantly improves the performance of some of the
FileSystems (note that some FileSystems, for example GCS, implement rename
in the form of a copy+delete, so there will be no significant performance
improvements for such FileSystems).

-Cham

On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:

> We might be able to replace this with Filesystem.rename(). One thing to
> keep in mind - the destination files might be in a different directory, so
> we would need to make sure that all Filesystems support cross-directory
> rename.
>
> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> +dev
>>
>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> just came across FileBasedSink.WriteOperation class which does have
>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>> "move". With large files I find it quote no efficient if underlying FS
>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>> it be a copy?
>>>
>>>
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>
>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Chamikara Jayalath <ch...@google.com>.
Also, we'll have to use StandardMoveOptions.IGNORE_MISSING_FILES for
supporting failures of the rename step. I think this is a good change to do
if the change significantly improves the performance of some of the
FileSystems (note that some FileSystems, for example GCS, implement rename
in the form of a copy+delete, so there will be no significant performance
improvements for such FileSystems).

-Cham

On Thu, Jul 26, 2018 at 10:14 AM Reuven Lax <re...@google.com> wrote:

> We might be able to replace this with Filesystem.rename(). One thing to
> keep in mind - the destination files might be in a different directory, so
> we would need to make sure that all Filesystems support cross-directory
> rename.
>
> On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> +dev
>>
>> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> just came across FileBasedSink.WriteOperation class which does have
>>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>>> "move". With large files I find it quote no efficient if underlying FS
>>> supports more efficient ways, so I wonder what is the story behind it? Must
>>> it be a copy?
>>>
>>>
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>>
>>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Reuven Lax <re...@google.com>.
We might be able to replace this with Filesystem.rename(). One thing to
keep in mind - the destination files might be in a different directory, so
we would need to make sure that all Filesystems support cross-directory
rename.

On Thu, Jul 26, 2018 at 9:58 AM Lukasz Cwik <lc...@google.com> wrote:

> +dev
>
> On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com>
> wrote:
>
>> Hello,
>>
>> just came across FileBasedSink.WriteOperation class which does have
>> moveToOutput() method. Implementation does a Filesystem.copy() instead of
>> "move". With large files I find it quote no efficient if underlying FS
>> supports more efficient ways, so I wonder what is the story behind it? Must
>> it be a copy?
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>>
>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Lukasz Cwik <lc...@google.com>.
+dev

On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com> wrote:

> Hello,
>
> just came across FileBasedSink.WriteOperation class which does have
> moveToOutput() method. Implementation does a Filesystem.copy() instead of
> "move". With large files I find it quote no efficient if underlying FS
> supports more efficient ways, so I wonder what is the story behind it? Must
> it be a copy?
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>

Re: FileBasedSink.WriteOperation copy instead of move?

Posted by Lukasz Cwik <lc...@google.com>.
+dev

On Thu, Jul 26, 2018 at 2:40 AM Jozef Vilcek <jo...@gmail.com> wrote:

> Hello,
>
> just came across FileBasedSink.WriteOperation class which does have
> moveToOutput() method. Implementation does a Filesystem.copy() instead of
> "move". With large files I find it quote no efficient if underlying FS
> supports more efficient ways, so I wonder what is the story behind it? Must
> it be a copy?
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L761
>