You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by David Janíček <da...@gmail.com> on 2020/08/03 12:13:10 UTC

[BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Hello everyone,

I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
DefaultFilenamePolicy.ParamsCoder loses information whether
DefaultFilenamePolicy.Params's baseFilename resource is file or directory
on some filesystems, at least on local FS and HDFS.

After discussion with @dmvk and @lukecwik, we have agreed that the best
solution could be to take the breaking change and use ResourceIdCoder for
encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
file/directory information is preserved.
The solution is implemented in pull request
https://github.com/apache/beam/pull/12050.

I'd like to ask if there is a consensus on this breaking change. Is
everyone OK with this?
Thanks in advance for answers.

Best regards,
David

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by David Janíček <ja...@gmail.com>.
Thank you, Luke, for the review and the suggestions. I tried to implement
them.

st 19. 8. 2020 v 19:21 odesílatel Luke Cwik <lc...@google.com> napsal:

> I took a look at the PR and suggested some changes.
>
> On Tue, Aug 18, 2020 at 8:16 AM David Janíček <ja...@gmail.com>
> wrote:
>
>> I looked at the possibility to fix the underlying filesystem and it turns
>> out that only the local filesystem couldn't handle decoding right, HDFS and
>> some other filesystem, e.g. S3, already have a check for that.
>> So I added a similar check to the local filesystem too. The
>> implementation is in the same pull request
>> https://github.com/apache/beam/pull/12050.
>>
>> Can you take a look at it, please?
>>
>> Thanks,
>> David
>>
>> út 11. 8. 2020 v 19:39 odesílatel Luke Cwik <lc...@google.com> napsal:
>>
>>> The filesystem "fixes" all surmount to removing the "isDirectory"
>>> boolean bit and encoding whether something is a directory in the string
>>> part of the resource specification which also turns out to be backwards
>>> incompatible (just in a different way).
>>>
>>> Removing the "directory" bit would be great and that would allow us to
>>> use strings instead of resource ids but would require filesystems to
>>> perform the mapping from some standard path specification to their internal
>>> representation.
>>>
>>> On Wed, Aug 5, 2020 at 9:26 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> So, based on the comments in the PR, the underlying issue seems to be
>>>> 'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
>>>> not returning the correct result, right ?
>>>> If so I think the correct fix might be your proposal (2) - Try to fix
>>>> the underlying filesystem to do a better job of file/dir matching
>>>>
>>>> This is a bug we probably have to fix anyways for the local filesystem
>>>> and/or HDFS and this will also give us a solution that does not break
>>>> update compatibility.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> Cham, that was one of the options I had mentioned on the PR. The
>>>>> difference here is that this is a bug fix and existing users could be
>>>>> broken unknowingly so it might be worthwhile to take that breaking change
>>>>> (and possibly provide users a way to perform an upgrade using the old
>>>>> implementation).
>>>>>
>>>>>
>>>>> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>> This might break the update compatibility for Dataflow streaming
>>>>>> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik
>>>>>> <lc...@google.com>
>>>>>>
>>>>>> In other cases, to save update compatibility, we introduced a user
>>>>>> option that changes the coder only when the user explicitly asks for an
>>>>>> updated feature that requires the new coder. For example,
>>>>>> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> I've reported an issue
>>>>>>> https://issues.apache.org/jira/browse/BEAM-10292 which is about
>>>>>>> broken DefaultFilenamePolicy.ParamsCoder behavior.
>>>>>>> DefaultFilenamePolicy.ParamsCoder loses information whether
>>>>>>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>>>>>>> on some filesystems, at least on local FS and HDFS.
>>>>>>>
>>>>>>> After discussion with @dmvk and @lukecwik, we have agreed that the
>>>>>>> best solution could be to take the breaking change and use ResourceIdCoder
>>>>>>> for encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way
>>>>>>> the file/directory information is preserved.
>>>>>>> The solution is implemented in pull request
>>>>>>> https://github.com/apache/beam/pull/12050.
>>>>>>>
>>>>>>> I'd like to ask if there is a consensus on this breaking change. Is
>>>>>>> everyone OK with this?
>>>>>>> Thanks in advance for answers.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> David
>>>>>>>
>>>>>>
>>
>> --
>> S pozdravem
>> David Janíček
>>
>

-- 
S pozdravem
David Janíček

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by Luke Cwik <lc...@google.com>.
I took a look at the PR and suggested some changes.

On Tue, Aug 18, 2020 at 8:16 AM David Janíček <ja...@gmail.com>
wrote:

> I looked at the possibility to fix the underlying filesystem and it turns
> out that only the local filesystem couldn't handle decoding right, HDFS and
> some other filesystem, e.g. S3, already have a check for that.
> So I added a similar check to the local filesystem too. The implementation
> is in the same pull request https://github.com/apache/beam/pull/12050.
>
> Can you take a look at it, please?
>
> Thanks,
> David
>
> út 11. 8. 2020 v 19:39 odesílatel Luke Cwik <lc...@google.com> napsal:
>
>> The filesystem "fixes" all surmount to removing the "isDirectory" boolean
>> bit and encoding whether something is a directory in the string part of the
>> resource specification which also turns out to be backwards incompatible
>> (just in a different way).
>>
>> Removing the "directory" bit would be great and that would allow us to
>> use strings instead of resource ids but would require filesystems to
>> perform the mapping from some standard path specification to their internal
>> representation.
>>
>> On Wed, Aug 5, 2020 at 9:26 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> So, based on the comments in the PR, the underlying issue seems to be
>>> 'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
>>> not returning the correct result, right ?
>>> If so I think the correct fix might be your proposal (2) - Try to fix
>>> the underlying filesystem to do a better job of file/dir matching
>>>
>>> This is a bug we probably have to fix anyways for the local filesystem
>>> and/or HDFS and this will also give us a solution that does not break
>>> update compatibility.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Cham, that was one of the options I had mentioned on the PR. The
>>>> difference here is that this is a bug fix and existing users could be
>>>> broken unknowingly so it might be worthwhile to take that breaking change
>>>> (and possibly provide users a way to perform an upgrade using the old
>>>> implementation).
>>>>
>>>>
>>>> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <ch...@google.com>
>>>> wrote:
>>>>
>>>>> This might break the update compatibility for Dataflow streaming
>>>>> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik
>>>>> <lc...@google.com>
>>>>>
>>>>> In other cases, to save update compatibility, we introduced a user
>>>>> option that changes the coder only when the user explicitly asks for an
>>>>> updated feature that requires the new coder. For example,
>>>>> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> I've reported an issue
>>>>>> https://issues.apache.org/jira/browse/BEAM-10292 which is about
>>>>>> broken DefaultFilenamePolicy.ParamsCoder behavior.
>>>>>> DefaultFilenamePolicy.ParamsCoder loses information whether
>>>>>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>>>>>> on some filesystems, at least on local FS and HDFS.
>>>>>>
>>>>>> After discussion with @dmvk and @lukecwik, we have agreed that the
>>>>>> best solution could be to take the breaking change and use ResourceIdCoder
>>>>>> for encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way
>>>>>> the file/directory information is preserved.
>>>>>> The solution is implemented in pull request
>>>>>> https://github.com/apache/beam/pull/12050.
>>>>>>
>>>>>> I'd like to ask if there is a consensus on this breaking change. Is
>>>>>> everyone OK with this?
>>>>>> Thanks in advance for answers.
>>>>>>
>>>>>> Best regards,
>>>>>> David
>>>>>>
>>>>>
>
> --
> S pozdravem
> David Janíček
>

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by David Janíček <ja...@gmail.com>.
I looked at the possibility to fix the underlying filesystem and it turns
out that only the local filesystem couldn't handle decoding right, HDFS and
some other filesystem, e.g. S3, already have a check for that.
So I added a similar check to the local filesystem too. The implementation
is in the same pull request https://github.com/apache/beam/pull/12050.

Can you take a look at it, please?

Thanks,
David

út 11. 8. 2020 v 19:39 odesílatel Luke Cwik <lc...@google.com> napsal:

> The filesystem "fixes" all surmount to removing the "isDirectory" boolean
> bit and encoding whether something is a directory in the string part of the
> resource specification which also turns out to be backwards incompatible
> (just in a different way).
>
> Removing the "directory" bit would be great and that would allow us to use
> strings instead of resource ids but would require filesystems to perform
> the mapping from some standard path specification to their internal
> representation.
>
> On Wed, Aug 5, 2020 at 9:26 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> So, based on the comments in the PR, the underlying issue seems to be
>> 'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
>> not returning the correct result, right ?
>> If so I think the correct fix might be your proposal (2) - Try to fix the
>> underlying filesystem to do a better job of file/dir matching
>>
>> This is a bug we probably have to fix anyways for the local filesystem
>> and/or HDFS and this will also give us a solution that does not break
>> update compatibility.
>>
>> Thanks,
>> Cham
>>
>> On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Cham, that was one of the options I had mentioned on the PR. The
>>> difference here is that this is a bug fix and existing users could be
>>> broken unknowingly so it might be worthwhile to take that breaking change
>>> (and possibly provide users a way to perform an upgrade using the old
>>> implementation).
>>>
>>>
>>> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> This might break the update compatibility for Dataflow streaming
>>>> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik
>>>> <lc...@google.com>
>>>>
>>>> In other cases, to save update compatibility, we introduced a user
>>>> option that changes the coder only when the user explicitly asks for an
>>>> updated feature that requires the new coder. For example,
>>>> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I've reported an issue
>>>>> https://issues.apache.org/jira/browse/BEAM-10292 which is about
>>>>> broken DefaultFilenamePolicy.ParamsCoder behavior.
>>>>> DefaultFilenamePolicy.ParamsCoder loses information whether
>>>>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>>>>> on some filesystems, at least on local FS and HDFS.
>>>>>
>>>>> After discussion with @dmvk and @lukecwik, we have agreed that the
>>>>> best solution could be to take the breaking change and use ResourceIdCoder
>>>>> for encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way
>>>>> the file/directory information is preserved.
>>>>> The solution is implemented in pull request
>>>>> https://github.com/apache/beam/pull/12050.
>>>>>
>>>>> I'd like to ask if there is a consensus on this breaking change. Is
>>>>> everyone OK with this?
>>>>> Thanks in advance for answers.
>>>>>
>>>>> Best regards,
>>>>> David
>>>>>
>>>>

-- 
S pozdravem
David Janíček

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by Luke Cwik <lc...@google.com>.
The filesystem "fixes" all surmount to removing the "isDirectory" boolean
bit and encoding whether something is a directory in the string part of the
resource specification which also turns out to be backwards incompatible
(just in a different way).

Removing the "directory" bit would be great and that would allow us to use
strings instead of resource ids but would require filesystems to perform
the mapping from some standard path specification to their internal
representation.

On Wed, Aug 5, 2020 at 9:26 PM Chamikara Jayalath <ch...@google.com>
wrote:

> So, based on the comments in the PR, the underlying issue seems to be
> 'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
> not returning the correct result, right ?
> If so I think the correct fix might be your proposal (2) - Try to fix the
> underlying filesystem to do a better job of file/dir matching
>
> This is a bug we probably have to fix anyways for the local filesystem
> and/or HDFS and this will also give us a solution that does not break
> update compatibility.
>
> Thanks,
> Cham
>
> On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik <lc...@google.com> wrote:
>
>> Cham, that was one of the options I had mentioned on the PR. The
>> difference here is that this is a bug fix and existing users could be
>> broken unknowingly so it might be worthwhile to take that breaking change
>> (and possibly provide users a way to perform an upgrade using the old
>> implementation).
>>
>>
>> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> This might break the update compatibility for Dataflow streaming
>>> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik
>>> <lc...@google.com>
>>>
>>> In other cases, to save update compatibility, we introduced a user
>>> option that changes the coder only when the user explicitly asks for an
>>> updated feature that requires the new coder. For example,
>>> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com>
>>> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
>>>> which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
>>>> DefaultFilenamePolicy.ParamsCoder loses information whether
>>>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>>>> on some filesystems, at least on local FS and HDFS.
>>>>
>>>> After discussion with @dmvk and @lukecwik, we have agreed that the best
>>>> solution could be to take the breaking change and use ResourceIdCoder for
>>>> encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
>>>> file/directory information is preserved.
>>>> The solution is implemented in pull request
>>>> https://github.com/apache/beam/pull/12050.
>>>>
>>>> I'd like to ask if there is a consensus on this breaking change. Is
>>>> everyone OK with this?
>>>> Thanks in advance for answers.
>>>>
>>>> Best regards,
>>>> David
>>>>
>>>

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by Chamikara Jayalath <ch...@google.com>.
So, based on the comments in the PR, the underlying issue seems to be
'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
not returning the correct result, right ?
If so I think the correct fix might be your proposal (2) - Try to fix the
underlying filesystem to do a better job of file/dir matching

This is a bug we probably have to fix anyways for the local filesystem
and/or HDFS and this will also give us a solution that does not break
update compatibility.

Thanks,
Cham

On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik <lc...@google.com> wrote:

> Cham, that was one of the options I had mentioned on the PR. The
> difference here is that this is a bug fix and existing users could be
> broken unknowingly so it might be worthwhile to take that breaking change
> (and possibly provide users a way to perform an upgrade using the old
> implementation).
>
>
> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> This might break the update compatibility for Dataflow streaming
>> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik
>> <lc...@google.com>
>>
>> In other cases, to save update compatibility, we introduced a user option
>> that changes the coder only when the user explicitly asks for an updated
>> feature that requires the new coder. For example,
>> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>>
>> Thanks,
>> Cham
>>
>>
>> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
>>> which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
>>> DefaultFilenamePolicy.ParamsCoder loses information whether
>>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>>> on some filesystems, at least on local FS and HDFS.
>>>
>>> After discussion with @dmvk and @lukecwik, we have agreed that the best
>>> solution could be to take the breaking change and use ResourceIdCoder for
>>> encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
>>> file/directory information is preserved.
>>> The solution is implemented in pull request
>>> https://github.com/apache/beam/pull/12050.
>>>
>>> I'd like to ask if there is a consensus on this breaking change. Is
>>> everyone OK with this?
>>> Thanks in advance for answers.
>>>
>>> Best regards,
>>> David
>>>
>>

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by Luke Cwik <lc...@google.com>.
Cham, that was one of the options I had mentioned on the PR. The difference
here is that this is a bug fix and existing users could be broken
unknowingly so it might be worthwhile to take that breaking change (and
possibly provide users a way to perform an upgrade using the old
implementation).


On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath <ch...@google.com>
wrote:

> This might break the update compatibility for Dataflow streaming
> pipelines. +Reuven Lax <re...@google.com>  +Lukasz Cwik <lc...@google.com>
>
>
> In other cases, to save update compatibility, we introduced a user option
> that changes the coder only when the user explicitly asks for an updated
> feature that requires the new coder. For example,
> https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3
>
> Thanks,
> Cham
>
>
> On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
>> which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
>> DefaultFilenamePolicy.ParamsCoder loses information whether
>> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
>> on some filesystems, at least on local FS and HDFS.
>>
>> After discussion with @dmvk and @lukecwik, we have agreed that the best
>> solution could be to take the breaking change and use ResourceIdCoder for
>> encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
>> file/directory information is preserved.
>> The solution is implemented in pull request
>> https://github.com/apache/beam/pull/12050.
>>
>> I'd like to ask if there is a consensus on this breaking change. Is
>> everyone OK with this?
>> Thanks in advance for answers.
>>
>> Best regards,
>> David
>>
>

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

Posted by Chamikara Jayalath <ch...@google.com>.
This might break the update compatibility for Dataflow streaming
pipelines. +Reuven
Lax <re...@google.com>  +Lukasz Cwik <lc...@google.com>

In other cases, to save update compatibility, we introduced a user option
that changes the coder only when the user explicitly asks for an updated
feature that requires the new coder. For example,
https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3

Thanks,
Cham


On Mon, Aug 3, 2020 at 10:00 AM David Janíček <da...@gmail.com> wrote:

> Hello everyone,
>
> I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
> which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
> DefaultFilenamePolicy.ParamsCoder loses information whether
> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
> on some filesystems, at least on local FS and HDFS.
>
> After discussion with @dmvk and @lukecwik, we have agreed that the best
> solution could be to take the breaking change and use ResourceIdCoder for
> encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
> file/directory information is preserved.
> The solution is implemented in pull request
> https://github.com/apache/beam/pull/12050.
>
> I'd like to ask if there is a consensus on this breaking change. Is
> everyone OK with this?
> Thanks in advance for answers.
>
> Best regards,
> David
>