You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Jeff Klukas <jk...@mozilla.com> on 2018/11/02 12:43:59 UTC

Evolving a Coder for an added field

I'm adding a new lastModifiedMillis field to MatchResult.Metadata [0] which
requires also updating MetadataCoder, but it's not clear to me whether
there are guidelines to follow when evolving a type when that changes the
encoding.

Is a user allowed to update Beam library versions as part of updating a
pipeline? If so, there could be a situation where an updated pipeline is
reading state that includes Metadata encoded without the new
lastModifiedMillis field, which would cause a CodingException to be thrown.

Is there prior art for evolving a type and its Coder? Should I be defensive
and catch CodingException when attempting to decode the new field,
providing a default value?

[0] https://github.com/apache/beam/pull/6914

Re: Evolving a Coder for an added field

Posted by Jeff Klukas <jk...@mozilla.com>.

Picking this back up, I've modified the PR [0] to add a MetadataCoderV2
that encodes/decodes the new lastModifiedMillis, and MetadataCoder now
provides a default -1 value for lastModifiedMillis when decoding to
maintain compatibility. The intent is to follow the same pattern as the
existing TableDestinationCoderV2. MetadataCoder remains the default
registered coder for Metadata, so MetadataCoderV2 is strictly opt-in at
this point.

I'm hoping to get review on the above so the change can be available in the
short term, while work continues to understand the way forward for coder
versioning. I added a link to this thread to BEAM-3616 [1] which already
captures the need for coder versioning.

https://github.com/apache/beam/pull/6914
https://issues.apache.org/jira/browse/BEAM-3616

On Mon, Nov 26, 2018 at 11:47 AM Lukasz Cwik <lc...@google.com> wrote:

> Reuven was one of the people I reached out to on this matter and he
> replied on this thread.
>
> On Mon, Nov 26, 2018 at 7:07 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Modifying an existing coder is a non-starter until we have a versioning
>> story. Creating an entirely new coder should definitely be possible, and
>> using it either opt-in or, if a good enough case can be made, possibly even
>> opt-out could get this unblocked.
>>
>> On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> Lukasz - Were you able to get any more context on the possibility of
>>> versioning coders from other folks at Google?
>>>
>>> It sounds like adding versioning for coders and/or schemas is
>>> potentially a large change. At this point, should I just write up some
>>> highlights from this thread in a JIRA issue for future tracking?
>>>
>>> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> A few thoughts:
>>>>
>>>> 1. I agree with you about coder versioning. The lack of a good story
>>>> around versioning has been a huge pain here, and it's unfortunate that
>>>> nobody ever worked on this.
>>>>
>>>> 2. I think versioning schemas will be easier than versioning coders
>>>> (especially for adding new fields). In many cases I suggest we start
>>>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
>>>> we can migrate all of our internal payload to schemas. Schemas support
>>>> nested fields, repeated fields, and map fields - which can model most thing.
>>>>
>>>> 3. There was a Beam proposal for a way to generically handle
>>>> incompatible schema updates via snapshots. The idea was that such updates
>>>> can be accompanied by a transform that maps a pipeline snapshot into a new
>>>> snapshot with the encodings modified.
>>>>
>>>> Reuven
>>>>
>>>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jk...@mozilla.com>
>>>> wrote:
>>>>
>>>>> Conversation here has fizzled, but sounds like there's basically a
>>>>> consensus here on a need for a new concept of Coder versioning that's
>>>>> accessible at the Java level in order to allow an evolution path. Further,
>>>>> it sounds like my open PR [0] for adding a new field to Metadata is
>>>>> essentially blocked until we have coder versioning in place.
>>>>>
>>>>> Is there any existing documentation of these concepts, or should I go
>>>>> ahead and file a new Jira issue summarizing the problem? I don't think I
>>>>> have a comprehensive enough understanding of the Coder machinery to be able
>>>>> to design a solution, so I'd need to hand this off or simply leave it in
>>>>> the Jira backlog.
>>>>>
>>>>> [0] https://github.com/apache/beam/pull/6914
>>>>>
>>>>>
>>>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, a Coder author should be able to register a URN with a mapping
>>>>>> from (components + payload) -> Coder (and vice versa), and this should
>>>>>> be more lightweight than manually editing the proto files.
>>>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>>>>>> >
>>>>>> > +1
>>>>>> >
>>>>>> > I think that coders should be immutable/versioned. The SDK should
>>>>>> know about all the available versions and be able to associate the data
>>>>>> (stream or at rest) with the corresponding coder version via URN. We can
>>>>>> also look how that is solved elsewhere, for example the Kafka schema
>>>>>> registry.
>>>>>> >
>>>>>> > Today we only have a few URNs for standard coders:
>>>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>>>>> >
>>>>>> > I imagine we will need a coder registry where IOs and users can add
>>>>>> their versioned coders also?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Thomas
>>>>>> >
>>>>>> >
>>>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <
>>>>>> jb@nanthrax.net> wrote:
>>>>>> >>
>>>>>> >> It makes sense to have a more concrete URN including the version.
>>>>>> >>
>>>>>> >> Good idea Robert.
>>>>>> >>
>>>>>> >> Regards
>>>>>> >> JB
>>>>>> >>
>>>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>>>>> >> > I think we'll want to allow upgrades across SDK versions. A
>>>>>> runner
>>>>>> >> > should be able to recognize when a coder (or any other aspect of
>>>>>> the
>>>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we
>>>>>> remove
>>>>>> >> > coders from sources/sinks, there's also possibly the expectation
>>>>>> that
>>>>>> >> > one should be able to read data from a source written with that
>>>>>> same
>>>>>> >> > coder across versions as well.)
>>>>>> >> >
>>>>>> >> > I think it really comes down to how coders are named. If we
>>>>>> decide to
>>>>>> >> > let coders change arbitrarily between versions, probably the URN
>>>>>> for
>>>>>> >> > SerializedJavaCoder should have the SDK version number in it.
>>>>>> Coders
>>>>>> >> > that are stable across SDKs can have better, more stable URNs
>>>>>> defined
>>>>>> >> > and registered.
>>>>>> >> >
>>>>>> >> > I am more OK with changing the registry to infer different
>>>>>> coders as
>>>>>> >> > the SDK evolves (which would be detected and manually
>>>>>> overwritten with
>>>>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>>>>> >> > should still be done with caution as it will make upgrading
>>>>>> harder.
>>>>>> >> > Highly composite, experimental coders should possibly be
>>>>>> designed in
>>>>>> >> > an intrinsically extensible way.
>>>>>> >> >
>>>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>>>>> jb@nanthrax.net> wrote:
>>>>>> >> >>
>>>>>> >> >> That's really a pita. It's an important and impacting change.
>>>>>> >> >>
>>>>>> >> >> I would go to 1.
>>>>>> >> >>
>>>>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>>>>> cherry
>>>>>> >> >> pick some changes. Using master as LTS release branch won't
>>>>>> work IMHO.
>>>>>> >> >>
>>>>>> >> >> Regards
>>>>>> >> >> JB
>>>>>> >> >>
>>>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>>>>> >> >>> For some extra context this change touches more than FileIO, in
>>>>>> >> >>> reality this will affect updates in any file-based pipelines
>>>>>> because
>>>>>> >> >>> the metadata on each file will have now an extra field for the
>>>>>> >> >>> lastModifiedDate.
>>>>>> >> >>>
>>>>>> >> >>> The PR looks perfect, only issue is the backwards
>>>>>> compatibility Coder
>>>>>> >> >>> question. Knowing that probably Dataflow is the only one
>>>>>> affected, I
>>>>>> >> >>> would like to know what can we do?
>>>>>> >> >>>
>>>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>>>>> versions
>>>>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>>>>> >> >>> discussion)?
>>>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a
>>>>>> repeated
>>>>>> >> >>> Matadata object) ? In this case where is the right place to
>>>>>> identify
>>>>>> >> >>> and decide what coder to use?
>>>>>> >> >>>
>>>>>> >> >>> Other ideas... ?
>>>>>> >> >>>
>>>>>> >> >>> Last thing, the link that Luke shared does not seem to work
>>>>>> (looks
>>>>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>>>>> >> >>> interested in the drain/update proposal:
>>>>>> >> >>>
>>>>>> >> >>> [2]
>>>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>>>>> wrote:
>>>>>> >> >>>>
>>>>>> >> >>>> I think the idea is that you would use one coder for paths
>>>>>> where you don't need this information and would have FileIO provide a
>>>>>> separate path that uses your updated coder.
>>>>>> >> >>>> Existing users would not be impacted and users of the new
>>>>>> FileIO that depend on this information would not be able to have updated
>>>>>> their pipeline in the first place.
>>>>>> >> >>>>
>>>>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>>>>> break it for existing users though since I don't know how feasible my
>>>>>> suggestion above is.
>>>>>> >> >>>>
>>>>>> >> >>>>
>>>>>> >> >>>>
>>>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <
>>>>>> jklukas@mozilla.com> wrote:
>>>>>> >> >>>>>
>>>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>>>> >> >>>>>
>>>>>> >> >>>>> It sounds like there's no explicit user contract about
>>>>>> evolving Coder classes in the Java SDK and users might reasonably assume
>>>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or
>>>>>> Flink runners might reasonably expect that they can update the Java SDK
>>>>>> version used in their pipeline when performing an update.
>>>>>> >> >>>>>
>>>>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>>>>> might not be possible except in a major version bump where it's obvious to
>>>>>> users to expect breaking changes and not to expect an "update" operation to
>>>>>> work.
>>>>>> >> >>>>>
>>>>>> >> >>>>> It's not clear to me what changing the "name" of a coder
>>>>>> would look like or whether that's a tenable solution here. Would that
>>>>>> change be able to happen within the SDK itself, or is it something users
>>>>>> would need to specify?
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Jean-Baptiste Onofré
>>>>>> >> >> jbonofre@apache.org
>>>>>> >> >> http://blog.nanthrax.net
>>>>>> >> >> Talend - http://www.talend.com
>>>>>> >>
>>>>>> >> --
>>>>>> >> Jean-Baptiste Onofré
>>>>>> >> jbonofre@apache.org
>>>>>> >> http://blog.nanthrax.net
>>>>>> >> Talend - http://www.talend.com
>>>>>>
>>>>>

Re: Evolving a Coder for an added field

Posted by Lukasz Cwik <lc...@google.com>.

Reuven was one of the people I reached out to on this matter and he replied
on this thread.

On Mon, Nov 26, 2018 at 7:07 AM Robert Bradshaw <ro...@google.com> wrote:

> Modifying an existing coder is a non-starter until we have a versioning
> story. Creating an entirely new coder should definitely be possible, and
> using it either opt-in or, if a good enough case can be made, possibly even
> opt-out could get this unblocked.
>
> On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> Lukasz - Were you able to get any more context on the possibility of
>> versioning coders from other folks at Google?
>>
>> It sounds like adding versioning for coders and/or schemas is potentially
>> a large change. At this point, should I just write up some highlights from
>> this thread in a JIRA issue for future tracking?
>>
>> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:
>>
>>> A few thoughts:
>>>
>>> 1. I agree with you about coder versioning. The lack of a good story
>>> around versioning has been a huge pain here, and it's unfortunate that
>>> nobody ever worked on this.
>>>
>>> 2. I think versioning schemas will be easier than versioning coders
>>> (especially for adding new fields). In many cases I suggest we start
>>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
>>> we can migrate all of our internal payload to schemas. Schemas support
>>> nested fields, repeated fields, and map fields - which can model most thing.
>>>
>>> 3. There was a Beam proposal for a way to generically handle
>>> incompatible schema updates via snapshots. The idea was that such updates
>>> can be accompanied by a transform that maps a pipeline snapshot into a new
>>> snapshot with the encodings modified.
>>>
>>> Reuven
>>>
>>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jk...@mozilla.com> wrote:
>>>
>>>> Conversation here has fizzled, but sounds like there's basically a
>>>> consensus here on a need for a new concept of Coder versioning that's
>>>> accessible at the Java level in order to allow an evolution path. Further,
>>>> it sounds like my open PR [0] for adding a new field to Metadata is
>>>> essentially blocked until we have coder versioning in place.
>>>>
>>>> Is there any existing documentation of these concepts, or should I go
>>>> ahead and file a new Jira issue summarizing the problem? I don't think I
>>>> have a comprehensive enough understanding of the Coder machinery to be able
>>>> to design a solution, so I'd need to hand this off or simply leave it in
>>>> the Jira backlog.
>>>>
>>>> [0] https://github.com/apache/beam/pull/6914
>>>>
>>>>
>>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> Yes, a Coder author should be able to register a URN with a mapping
>>>>> from (components + payload) -> Coder (and vice versa), and this should
>>>>> be more lightweight than manually editing the proto files.
>>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > I think that coders should be immutable/versioned. The SDK should
>>>>> know about all the available versions and be able to associate the data
>>>>> (stream or at rest) with the corresponding coder version via URN. We can
>>>>> also look how that is solved elsewhere, for example the Kafka schema
>>>>> registry.
>>>>> >
>>>>> > Today we only have a few URNs for standard coders:
>>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>>>> >
>>>>> > I imagine we will need a coder registry where IOs and users can add
>>>>> their versioned coders also?
>>>>> >
>>>>> > Thanks,
>>>>> > Thomas
>>>>> >
>>>>> >
>>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>> wrote:
>>>>> >>
>>>>> >> It makes sense to have a more concrete URN including the version.
>>>>> >>
>>>>> >> Good idea Robert.
>>>>> >>
>>>>> >> Regards
>>>>> >> JB
>>>>> >>
>>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>>>> >> > should be able to recognize when a coder (or any other aspect of
>>>>> the
>>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we
>>>>> remove
>>>>> >> > coders from sources/sinks, there's also possibly the expectation
>>>>> that
>>>>> >> > one should be able to read data from a source written with that
>>>>> same
>>>>> >> > coder across versions as well.)
>>>>> >> >
>>>>> >> > I think it really comes down to how coders are named. If we
>>>>> decide to
>>>>> >> > let coders change arbitrarily between versions, probably the URN
>>>>> for
>>>>> >> > SerializedJavaCoder should have the SDK version number in it.
>>>>> Coders
>>>>> >> > that are stable across SDKs can have better, more stable URNs
>>>>> defined
>>>>> >> > and registered.
>>>>> >> >
>>>>> >> > I am more OK with changing the registry to infer different coders
>>>>> as
>>>>> >> > the SDK evolves (which would be detected and manually overwritten
>>>>> with
>>>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>>>> >> > should still be done with caution as it will make upgrading
>>>>> harder.
>>>>> >> > Highly composite, experimental coders should possibly be designed
>>>>> in
>>>>> >> > an intrinsically extensible way.
>>>>> >> >
>>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>>>> jb@nanthrax.net> wrote:
>>>>> >> >>
>>>>> >> >> That's really a pita. It's an important and impacting change.
>>>>> >> >>
>>>>> >> >> I would go to 1.
>>>>> >> >>
>>>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>>>> cherry
>>>>> >> >> pick some changes. Using master as LTS release branch won't work
>>>>> IMHO.
>>>>> >> >>
>>>>> >> >> Regards
>>>>> >> >> JB
>>>>> >> >>
>>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>>>> >> >>> For some extra context this change touches more than FileIO, in
>>>>> >> >>> reality this will affect updates in any file-based pipelines
>>>>> because
>>>>> >> >>> the metadata on each file will have now an extra field for the
>>>>> >> >>> lastModifiedDate.
>>>>> >> >>>
>>>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>>>> Coder
>>>>> >> >>> question. Knowing that probably Dataflow is the only one
>>>>> affected, I
>>>>> >> >>> would like to know what can we do?
>>>>> >> >>>
>>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>>>> versions
>>>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>>>> >> >>> discussion)?
>>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a
>>>>> repeated
>>>>> >> >>> Matadata object) ? In this case where is the right place to
>>>>> identify
>>>>> >> >>> and decide what coder to use?
>>>>> >> >>>
>>>>> >> >>> Other ideas... ?
>>>>> >> >>>
>>>>> >> >>> Last thing, the link that Luke shared does not seem to work
>>>>> (looks
>>>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>>>> >> >>> interested in the drain/update proposal:
>>>>> >> >>>
>>>>> >> >>> [2]
>>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>>>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> I think the idea is that you would use one coder for paths
>>>>> where you don't need this information and would have FileIO provide a
>>>>> separate path that uses your updated coder.
>>>>> >> >>>> Existing users would not be impacted and users of the new
>>>>> FileIO that depend on this information would not be able to have updated
>>>>> their pipeline in the first place.
>>>>> >> >>>>
>>>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>>>> break it for existing users though since I don't know how feasible my
>>>>> suggestion above is.
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <
>>>>> jklukas@mozilla.com> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>>> >> >>>>>
>>>>> >> >>>>> It sounds like there's no explicit user contract about
>>>>> evolving Coder classes in the Java SDK and users might reasonably assume
>>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or
>>>>> Flink runners might reasonably expect that they can update the Java SDK
>>>>> version used in their pipeline when performing an update.
>>>>> >> >>>>>
>>>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>>>> might not be possible except in a major version bump where it's obvious to
>>>>> users to expect breaking changes and not to expect an "update" operation to
>>>>> work.
>>>>> >> >>>>>
>>>>> >> >>>>> It's not clear to me what changing the "name" of a coder
>>>>> would look like or whether that's a tenable solution here. Would that
>>>>> change be able to happen within the SDK itself, or is it something users
>>>>> would need to specify?
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Jean-Baptiste Onofré
>>>>> >> >> jbonofre@apache.org
>>>>> >> >> http://blog.nanthrax.net
>>>>> >> >> Talend - http://www.talend.com
>>>>> >>
>>>>> >> --
>>>>> >> Jean-Baptiste Onofré
>>>>> >> jbonofre@apache.org
>>>>> >> http://blog.nanthrax.net
>>>>> >> Talend - http://www.talend.com
>>>>>
>>>>

Re: Evolving a Coder for an added field

Posted by Robert Bradshaw <ro...@google.com>.

Modifying an existing coder is a non-starter until we have a versioning
story. Creating an entirely new coder should definitely be possible, and
using it either opt-in or, if a good enough case can be made, possibly even
opt-out could get this unblocked.

On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jk...@mozilla.com> wrote:

> Lukasz - Were you able to get any more context on the possibility of
> versioning coders from other folks at Google?
>
> It sounds like adding versioning for coders and/or schemas is potentially
> a large change. At this point, should I just write up some highlights from
> this thread in a JIRA issue for future tracking?
>
> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:
>
>> A few thoughts:
>>
>> 1. I agree with you about coder versioning. The lack of a good story
>> around versioning has been a huge pain here, and it's unfortunate that
>> nobody ever worked on this.
>>
>> 2. I think versioning schemas will be easier than versioning coders
>> (especially for adding new fields). In many cases I suggest we start
>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
>> we can migrate all of our internal payload to schemas. Schemas support
>> nested fields, repeated fields, and map fields - which can model most thing.
>>
>> 3. There was a Beam proposal for a way to generically handle incompatible
>> schema updates via snapshots. The idea was that such updates can be
>> accompanied by a transform that maps a pipeline snapshot into a new
>> snapshot with the encodings modified.
>>
>> Reuven
>>
>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> Conversation here has fizzled, but sounds like there's basically a
>>> consensus here on a need for a new concept of Coder versioning that's
>>> accessible at the Java level in order to allow an evolution path. Further,
>>> it sounds like my open PR [0] for adding a new field to Metadata is
>>> essentially blocked until we have coder versioning in place.
>>>
>>> Is there any existing documentation of these concepts, or should I go
>>> ahead and file a new Jira issue summarizing the problem? I don't think I
>>> have a comprehensive enough understanding of the Coder machinery to be able
>>> to design a solution, so I'd need to hand this off or simply leave it in
>>> the Jira backlog.
>>>
>>> [0] https://github.com/apache/beam/pull/6914
>>>
>>>
>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Yes, a Coder author should be able to register a URN with a mapping
>>>> from (components + payload) -> Coder (and vice versa), and this should
>>>> be more lightweight than manually editing the proto files.
>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>>>> >
>>>> > +1
>>>> >
>>>> > I think that coders should be immutable/versioned. The SDK should
>>>> know about all the available versions and be able to associate the data
>>>> (stream or at rest) with the corresponding coder version via URN. We can
>>>> also look how that is solved elsewhere, for example the Kafka schema
>>>> registry.
>>>> >
>>>> > Today we only have a few URNs for standard coders:
>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>>> >
>>>> > I imagine we will need a coder registry where IOs and users can add
>>>> their versioned coders also?
>>>> >
>>>> > Thanks,
>>>> > Thomas
>>>> >
>>>> >
>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>> wrote:
>>>> >>
>>>> >> It makes sense to have a more concrete URN including the version.
>>>> >>
>>>> >> Good idea Robert.
>>>> >>
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>>> >> > should be able to recognize when a coder (or any other aspect of
>>>> the
>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we
>>>> remove
>>>> >> > coders from sources/sinks, there's also possibly the expectation
>>>> that
>>>> >> > one should be able to read data from a source written with that
>>>> same
>>>> >> > coder across versions as well.)
>>>> >> >
>>>> >> > I think it really comes down to how coders are named. If we decide
>>>> to
>>>> >> > let coders change arbitrarily between versions, probably the URN
>>>> for
>>>> >> > SerializedJavaCoder should have the SDK version number in it.
>>>> Coders
>>>> >> > that are stable across SDKs can have better, more stable URNs
>>>> defined
>>>> >> > and registered.
>>>> >> >
>>>> >> > I am more OK with changing the registry to infer different coders
>>>> as
>>>> >> > the SDK evolves (which would be detected and manually overwritten
>>>> with
>>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>>> >> > should still be done with caution as it will make upgrading harder.
>>>> >> > Highly composite, experimental coders should possibly be designed
>>>> in
>>>> >> > an intrinsically extensible way.
>>>> >> >
>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>>> jb@nanthrax.net> wrote:
>>>> >> >>
>>>> >> >> That's really a pita. It's an important and impacting change.
>>>> >> >>
>>>> >> >> I would go to 1.
>>>> >> >>
>>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>>> cherry
>>>> >> >> pick some changes. Using master as LTS release branch won't work
>>>> IMHO.
>>>> >> >>
>>>> >> >> Regards
>>>> >> >> JB
>>>> >> >>
>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>>> >> >>> For some extra context this change touches more than FileIO, in
>>>> >> >>> reality this will affect updates in any file-based pipelines
>>>> because
>>>> >> >>> the metadata on each file will have now an extra field for the
>>>> >> >>> lastModifiedDate.
>>>> >> >>>
>>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>>> Coder
>>>> >> >>> question. Knowing that probably Dataflow is the only one
>>>> affected, I
>>>> >> >>> would like to know what can we do?
>>>> >> >>>
>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>>> versions
>>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>>> >> >>> discussion)?
>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>>>> >> >>> Matadata object) ? In this case where is the right place to
>>>> identify
>>>> >> >>> and decide what coder to use?
>>>> >> >>>
>>>> >> >>> Other ideas... ?
>>>> >> >>>
>>>> >> >>> Last thing, the link that Luke shared does not seem to work
>>>> (looks
>>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>>> >> >>> interested in the drain/update proposal:
>>>> >> >>>
>>>> >> >>> [2]
>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> I think the idea is that you would use one coder for paths
>>>> where you don't need this information and would have FileIO provide a
>>>> separate path that uses your updated coder.
>>>> >> >>>> Existing users would not be impacted and users of the new
>>>> FileIO that depend on this information would not be able to have updated
>>>> their pipeline in the first place.
>>>> >> >>>>
>>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>>> break it for existing users though since I don't know how feasible my
>>>> suggestion above is.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <
>>>> jklukas@mozilla.com> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>> >> >>>>>
>>>> >> >>>>> It sounds like there's no explicit user contract about
>>>> evolving Coder classes in the Java SDK and users might reasonably assume
>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or
>>>> Flink runners might reasonably expect that they can update the Java SDK
>>>> version used in their pipeline when performing an update.
>>>> >> >>>>>
>>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>>> might not be possible except in a major version bump where it's obvious to
>>>> users to expect breaking changes and not to expect an "update" operation to
>>>> work.
>>>> >> >>>>>
>>>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>>>> look like or whether that's a tenable solution here. Would that change be
>>>> able to happen within the SDK itself, or is it something users would need
>>>> to specify?
>>>> >> >>
>>>> >> >> --
>>>> >> >> Jean-Baptiste Onofré
>>>> >> >> jbonofre@apache.org
>>>> >> >> http://blog.nanthrax.net
>>>> >> >> Talend - http://www.talend.com
>>>> >>
>>>> >> --
>>>> >> Jean-Baptiste Onofré
>>>> >> jbonofre@apache.org
>>>> >> http://blog.nanthrax.net
>>>> >> Talend - http://www.talend.com
>>>>
>>>

Re: Evolving a Coder for an added field

Posted by Jeff Klukas <jk...@mozilla.com>.

Lukasz - Were you able to get any more context on the possibility of
versioning coders from other folks at Google?

It sounds like adding versioning for coders and/or schemas is potentially a
large change. At this point, should I just write up some highlights from
this thread in a JIRA issue for future tracking?

On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:

> A few thoughts:
>
> 1. I agree with you about coder versioning. The lack of a good story
> around versioning has been a huge pain here, and it's unfortunate that
> nobody ever worked on this.
>
> 2. I think versioning schemas will be easier than versioning coders
> (especially for adding new fields). In many cases I suggest we start
> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
> we can migrate all of our internal payload to schemas. Schemas support
> nested fields, repeated fields, and map fields - which can model most thing.
>
> 3. There was a Beam proposal for a way to generically handle incompatible
> schema updates via snapshots. The idea was that such updates can be
> accompanied by a transform that maps a pipeline snapshot into a new
> snapshot with the encodings modified.
>
> Reuven
>
> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> Conversation here has fizzled, but sounds like there's basically a
>> consensus here on a need for a new concept of Coder versioning that's
>> accessible at the Java level in order to allow an evolution path. Further,
>> it sounds like my open PR [0] for adding a new field to Metadata is
>> essentially blocked until we have coder versioning in place.
>>
>> Is there any existing documentation of these concepts, or should I go
>> ahead and file a new Jira issue summarizing the problem? I don't think I
>> have a comprehensive enough understanding of the Coder machinery to be able
>> to design a solution, so I'd need to hand this off or simply leave it in
>> the Jira backlog.
>>
>> [0] https://github.com/apache/beam/pull/6914
>>
>>
>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Yes, a Coder author should be able to register a URN with a mapping
>>> from (components + payload) -> Coder (and vice versa), and this should
>>> be more lightweight than manually editing the proto files.
>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>>> >
>>> > +1
>>> >
>>> > I think that coders should be immutable/versioned. The SDK should know
>>> about all the available versions and be able to associate the data (stream
>>> or at rest) with the corresponding coder version via URN. We can also look
>>> how that is solved elsewhere, for example the Kafka schema registry.
>>> >
>>> > Today we only have a few URNs for standard coders:
>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>> >
>>> > I imagine we will need a coder registry where IOs and users can add
>>> their versioned coders also?
>>> >
>>> > Thanks,
>>> > Thomas
>>> >
>>> >
>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>> >>
>>> >> It makes sense to have a more concrete URN including the version.
>>> >>
>>> >> Good idea Robert.
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>> >> > should be able to recognize when a coder (or any other aspect of the
>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>>> >> > coders from sources/sinks, there's also possibly the expectation
>>> that
>>> >> > one should be able to read data from a source written with that same
>>> >> > coder across versions as well.)
>>> >> >
>>> >> > I think it really comes down to how coders are named. If we decide
>>> to
>>> >> > let coders change arbitrarily between versions, probably the URN for
>>> >> > SerializedJavaCoder should have the SDK version number in it. Coders
>>> >> > that are stable across SDKs can have better, more stable URNs
>>> defined
>>> >> > and registered.
>>> >> >
>>> >> > I am more OK with changing the registry to infer different coders as
>>> >> > the SDK evolves (which would be detected and manually overwritten
>>> with
>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>> >> > should still be done with caution as it will make upgrading harder.
>>> >> > Highly composite, experimental coders should possibly be designed in
>>> >> > an intrinsically extensible way.
>>> >> >
>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>> jb@nanthrax.net> wrote:
>>> >> >>
>>> >> >> That's really a pita. It's an important and impacting change.
>>> >> >>
>>> >> >> I would go to 1.
>>> >> >>
>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>> cherry
>>> >> >> pick some changes. Using master as LTS release branch won't work
>>> IMHO.
>>> >> >>
>>> >> >> Regards
>>> >> >> JB
>>> >> >>
>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>> >> >>> For some extra context this change touches more than FileIO, in
>>> >> >>> reality this will affect updates in any file-based pipelines
>>> because
>>> >> >>> the metadata on each file will have now an extra field for the
>>> >> >>> lastModifiedDate.
>>> >> >>>
>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>> Coder
>>> >> >>> question. Knowing that probably Dataflow is the only one
>>> affected, I
>>> >> >>> would like to know what can we do?
>>> >> >>>
>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>> versions
>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>> >> >>> discussion)?
>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>>> >> >>> Matadata object) ? In this case where is the right place to
>>> identify
>>> >> >>> and decide what coder to use?
>>> >> >>>
>>> >> >>> Other ideas... ?
>>> >> >>>
>>> >> >>> Last thing, the link that Luke shared does not seem to work (looks
>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>> >> >>> interested in the drain/update proposal:
>>> >> >>>
>>> >> >>> [2]
>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>> wrote:
>>> >> >>>>
>>> >> >>>> I think the idea is that you would use one coder for paths where
>>> you don't need this information and would have FileIO provide a separate
>>> path that uses your updated coder.
>>> >> >>>> Existing users would not be impacted and users of the new FileIO
>>> that depend on this information would not be able to have updated their
>>> pipeline in the first place.
>>> >> >>>>
>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>> break it for existing users though since I don't know how feasible my
>>> suggestion above is.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com>
>>> wrote:
>>> >> >>>>>
>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>> >> >>>>>
>>> >> >>>>> It sounds like there's no explicit user contract about evolving
>>> Coder classes in the Java SDK and users might reasonably assume Coders to
>>> be stable between SDK versions. Thus, users of the Dataflow or Flink
>>> runners might reasonably expect that they can update the Java SDK version
>>> used in their pipeline when performing an update.
>>> >> >>>>>
>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>> might not be possible except in a major version bump where it's obvious to
>>> users to expect breaking changes and not to expect an "update" operation to
>>> work.
>>> >> >>>>>
>>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>>> look like or whether that's a tenable solution here. Would that change be
>>> able to happen within the SDK itself, or is it something users would need
>>> to specify?
>>> >> >>
>>> >> >> --
>>> >> >> Jean-Baptiste Onofré
>>> >> >> jbonofre@apache.org
>>> >> >> http://blog.nanthrax.net
>>> >> >> Talend - http://www.talend.com
>>> >>
>>> >> --
>>> >> Jean-Baptiste Onofré
>>> >> jbonofre@apache.org
>>> >> http://blog.nanthrax.net
>>> >> Talend - http://www.talend.com
>>>
>>

Re: Evolving a Coder for an added field

Posted by Reuven Lax <re...@google.com>.

A few thoughts:

1. I agree with you about coder versioning. The lack of a good story around
versioning has been a huge pain here, and it's unfortunate that nobody ever
worked on this.

2. I think versioning schemas will be easier than versioning coders
(especially for adding new fields). In many cases I suggest we start
looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
we can migrate all of our internal payload to schemas. Schemas support
nested fields, repeated fields, and map fields - which can model most thing.

3. There was a Beam proposal for a way to generically handle incompatible
schema updates via snapshots. The idea was that such updates can be
accompanied by a transform that maps a pipeline snapshot into a new
snapshot with the encodings modified.

Reuven

On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jk...@mozilla.com> wrote:

> Conversation here has fizzled, but sounds like there's basically a
> consensus here on a need for a new concept of Coder versioning that's
> accessible at the Java level in order to allow an evolution path. Further,
> it sounds like my open PR [0] for adding a new field to Metadata is
> essentially blocked until we have coder versioning in place.
>
> Is there any existing documentation of these concepts, or should I go
> ahead and file a new Jira issue summarizing the problem? I don't think I
> have a comprehensive enough understanding of the Coder machinery to be able
> to design a solution, so I'd need to hand this off or simply leave it in
> the Jira backlog.
>
> [0] https://github.com/apache/beam/pull/6914
>
>
> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Yes, a Coder author should be able to register a URN with a mapping
>> from (components + payload) -> Coder (and vice versa), and this should
>> be more lightweight than manually editing the proto files.
>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>> >
>> > +1
>> >
>> > I think that coders should be immutable/versioned. The SDK should know
>> about all the available versions and be able to associate the data (stream
>> or at rest) with the corresponding coder version via URN. We can also look
>> how that is solved elsewhere, for example the Kafka schema registry.
>> >
>> > Today we only have a few URNs for standard coders:
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>> >
>> > I imagine we will need a coder registry where IOs and users can add
>> their versioned coders also?
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> >>
>> >> It makes sense to have a more concrete URN including the version.
>> >>
>> >> Good idea Robert.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>> >> > should be able to recognize when a coder (or any other aspect of the
>> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>> >> > coders from sources/sinks, there's also possibly the expectation that
>> >> > one should be able to read data from a source written with that same
>> >> > coder across versions as well.)
>> >> >
>> >> > I think it really comes down to how coders are named. If we decide to
>> >> > let coders change arbitrarily between versions, probably the URN for
>> >> > SerializedJavaCoder should have the SDK version number in it. Coders
>> >> > that are stable across SDKs can have better, more stable URNs defined
>> >> > and registered.
>> >> >
>> >> > I am more OK with changing the registry to infer different coders as
>> >> > the SDK evolves (which would be detected and manually overwritten
>> with
>> >> > the old ones, on a case-by-case basis, if they still exist). This
>> >> > should still be done with caution as it will make upgrading harder.
>> >> > Highly composite, experimental coders should possibly be designed in
>> >> > an intrinsically extensible way.
>> >> >
>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> >> >>
>> >> >> That's really a pita. It's an important and impacting change.
>> >> >>
>> >> >> I would go to 1.
>> >> >>
>> >> >> For LTS, as already said, I would create a LTS branch and only
>> cherry
>> >> >> pick some changes. Using master as LTS release branch won't work
>> IMHO.
>> >> >>
>> >> >> Regards
>> >> >> JB
>> >> >>
>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>> >> >>> For some extra context this change touches more than FileIO, in
>> >> >>> reality this will affect updates in any file-based pipelines
>> because
>> >> >>> the metadata on each file will have now an extra field for the
>> >> >>> lastModifiedDate.
>> >> >>>
>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>> Coder
>> >> >>> question. Knowing that probably Dataflow is the only one affected,
>> I
>> >> >>> would like to know what can we do?
>> >> >>>
>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>> versions
>> >> >>> (which makes sense and is probably more aligned with the LTS
>> >> >>> discussion)?
>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>> >> >>> Matadata object) ? In this case where is the right place to
>> identify
>> >> >>> and decide what coder to use?
>> >> >>>
>> >> >>> Other ideas... ?
>> >> >>>
>> >> >>> Last thing, the link that Luke shared does not seem to work (looks
>> >> >>> like a googley-friendly URL, here it is the full URL for those
>> >> >>> interested in the drain/update proposal:
>> >> >>>
>> >> >>> [2]
>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>> wrote:
>> >> >>>>
>> >> >>>> I think the idea is that you would use one coder for paths where
>> you don't need this information and would have FileIO provide a separate
>> path that uses your updated coder.
>> >> >>>> Existing users would not be impacted and users of the new FileIO
>> that depend on this information would not be able to have updated their
>> pipeline in the first place.
>> >> >>>>
>> >> >>>> If the feature in FileIO is experimental, we could choose to
>> break it for existing users though since I don't know how feasible my
>> suggestion above is.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com>
>> wrote:
>> >> >>>>>
>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>> >> >>>>>
>> >> >>>>> It sounds like there's no explicit user contract about evolving
>> Coder classes in the Java SDK and users might reasonably assume Coders to
>> be stable between SDK versions. Thus, users of the Dataflow or Flink
>> runners might reasonably expect that they can update the Java SDK version
>> used in their pipeline when performing an update.
>> >> >>>>>
>> >> >>>>> Based in that understanding, evolving a class like Metadata
>> might not be possible except in a major version bump where it's obvious to
>> users to expect breaking changes and not to expect an "update" operation to
>> work.
>> >> >>>>>
>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>> look like or whether that's a tenable solution here. Would that change be
>> able to happen within the SDK itself, or is it something users would need
>> to specify?
>> >> >>
>> >> >> --
>> >> >> Jean-Baptiste Onofré
>> >> >> jbonofre@apache.org
>> >> >> http://blog.nanthrax.net
>> >> >> Talend - http://www.talend.com
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbonofre@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>>
>

Re: Evolving a Coder for an added field

Posted by Lukasz Cwik <lc...@google.com>.

Jeff, I reached out to a few people at Google who were involved on the
Coder update issue in the past related to being able to update streaming
pipelines to see if they can work with you or at least provide context.

On Mon, Nov 12, 2018 at 10:16 AM Jeff Klukas <jk...@mozilla.com> wrote:

> Conversation here has fizzled, but sounds like there's basically a
> consensus here on a need for a new concept of Coder versioning that's
> accessible at the Java level in order to allow an evolution path. Further,
> it sounds like my open PR [0] for adding a new field to Metadata is
> essentially blocked until we have coder versioning in place.
>
> Is there any existing documentation of these concepts, or should I go
> ahead and file a new Jira issue summarizing the problem? I don't think I
> have a comprehensive enough understanding of the Coder machinery to be able
> to design a solution, so I'd need to hand this off or simply leave it in
> the Jira backlog.
>
> [0] https://github.com/apache/beam/pull/6914
>
>
> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Yes, a Coder author should be able to register a URN with a mapping
>> from (components + payload) -> Coder (and vice versa), and this should
>> be more lightweight than manually editing the proto files.
>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>> >
>> > +1
>> >
>> > I think that coders should be immutable/versioned. The SDK should know
>> about all the available versions and be able to associate the data (stream
>> or at rest) with the corresponding coder version via URN. We can also look
>> how that is solved elsewhere, for example the Kafka schema registry.
>> >
>> > Today we only have a few URNs for standard coders:
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>> >
>> > I imagine we will need a coder registry where IOs and users can add
>> their versioned coders also?
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> >>
>> >> It makes sense to have a more concrete URN including the version.
>> >>
>> >> Good idea Robert.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>> >> > should be able to recognize when a coder (or any other aspect of the
>> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>> >> > coders from sources/sinks, there's also possibly the expectation that
>> >> > one should be able to read data from a source written with that same
>> >> > coder across versions as well.)
>> >> >
>> >> > I think it really comes down to how coders are named. If we decide to
>> >> > let coders change arbitrarily between versions, probably the URN for
>> >> > SerializedJavaCoder should have the SDK version number in it. Coders
>> >> > that are stable across SDKs can have better, more stable URNs defined
>> >> > and registered.
>> >> >
>> >> > I am more OK with changing the registry to infer different coders as
>> >> > the SDK evolves (which would be detected and manually overwritten
>> with
>> >> > the old ones, on a case-by-case basis, if they still exist). This
>> >> > should still be done with caution as it will make upgrading harder.
>> >> > Highly composite, experimental coders should possibly be designed in
>> >> > an intrinsically extensible way.
>> >> >
>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>> >> >>
>> >> >> That's really a pita. It's an important and impacting change.
>> >> >>
>> >> >> I would go to 1.
>> >> >>
>> >> >> For LTS, as already said, I would create a LTS branch and only
>> cherry
>> >> >> pick some changes. Using master as LTS release branch won't work
>> IMHO.
>> >> >>
>> >> >> Regards
>> >> >> JB
>> >> >>
>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>> >> >>> For some extra context this change touches more than FileIO, in
>> >> >>> reality this will affect updates in any file-based pipelines
>> because
>> >> >>> the metadata on each file will have now an extra field for the
>> >> >>> lastModifiedDate.
>> >> >>>
>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>> Coder
>> >> >>> question. Knowing that probably Dataflow is the only one affected,
>> I
>> >> >>> would like to know what can we do?
>> >> >>>
>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>> versions
>> >> >>> (which makes sense and is probably more aligned with the LTS
>> >> >>> discussion)?
>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>> >> >>> Matadata object) ? In this case where is the right place to
>> identify
>> >> >>> and decide what coder to use?
>> >> >>>
>> >> >>> Other ideas... ?
>> >> >>>
>> >> >>> Last thing, the link that Luke shared does not seem to work (looks
>> >> >>> like a googley-friendly URL, here it is the full URL for those
>> >> >>> interested in the drain/update proposal:
>> >> >>>
>> >> >>> [2]
>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>> wrote:
>> >> >>>>
>> >> >>>> I think the idea is that you would use one coder for paths where
>> you don't need this information and would have FileIO provide a separate
>> path that uses your updated coder.
>> >> >>>> Existing users would not be impacted and users of the new FileIO
>> that depend on this information would not be able to have updated their
>> pipeline in the first place.
>> >> >>>>
>> >> >>>> If the feature in FileIO is experimental, we could choose to
>> break it for existing users though since I don't know how feasible my
>> suggestion above is.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com>
>> wrote:
>> >> >>>>>
>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>> >> >>>>>
>> >> >>>>> It sounds like there's no explicit user contract about evolving
>> Coder classes in the Java SDK and users might reasonably assume Coders to
>> be stable between SDK versions. Thus, users of the Dataflow or Flink
>> runners might reasonably expect that they can update the Java SDK version
>> used in their pipeline when performing an update.
>> >> >>>>>
>> >> >>>>> Based in that understanding, evolving a class like Metadata
>> might not be possible except in a major version bump where it's obvious to
>> users to expect breaking changes and not to expect an "update" operation to
>> work.
>> >> >>>>>
>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>> look like or whether that's a tenable solution here. Would that change be
>> able to happen within the SDK itself, or is it something users would need
>> to specify?
>> >> >>
>> >> >> --
>> >> >> Jean-Baptiste Onofré
>> >> >> jbonofre@apache.org
>> >> >> http://blog.nanthrax.net
>> >> >> Talend - http://www.talend.com
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbonofre@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>>
>

Re: Evolving a Coder for an added field

Posted by Jeff Klukas <jk...@mozilla.com>.

Conversation here has fizzled, but sounds like there's basically a
consensus here on a need for a new concept of Coder versioning that's
accessible at the Java level in order to allow an evolution path. Further,
it sounds like my open PR [0] for adding a new field to Metadata is
essentially blocked until we have coder versioning in place.

Is there any existing documentation of these concepts, or should I go ahead
and file a new Jira issue summarizing the problem? I don't think I have a
comprehensive enough understanding of the Coder machinery to be able to
design a solution, so I'd need to hand this off or simply leave it in the
Jira backlog.

[0] https://github.com/apache/beam/pull/6914


On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <ro...@google.com> wrote:

> Yes, a Coder author should be able to register a URN with a mapping
> from (components + payload) -> Coder (and vice versa), and this should
> be more lightweight than manually editing the proto files.
> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
> >
> > +1
> >
> > I think that coders should be immutable/versioned. The SDK should know
> about all the available versions and be able to associate the data (stream
> or at rest) with the corresponding coder version via URN. We can also look
> how that is solved elsewhere, for example the Kafka schema registry.
> >
> > Today we only have a few URNs for standard coders:
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
> >
> > I imagine we will need a coder registry where IOs and users can add
> their versioned coders also?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >>
> >> It makes sense to have a more concrete URN including the version.
> >>
> >> Good idea Robert.
> >>
> >> Regards
> >> JB
> >>
> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
> >> > I think we'll want to allow upgrades across SDK versions. A runner
> >> > should be able to recognize when a coder (or any other aspect of the
> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
> >> > coders from sources/sinks, there's also possibly the expectation that
> >> > one should be able to read data from a source written with that same
> >> > coder across versions as well.)
> >> >
> >> > I think it really comes down to how coders are named. If we decide to
> >> > let coders change arbitrarily between versions, probably the URN for
> >> > SerializedJavaCoder should have the SDK version number in it. Coders
> >> > that are stable across SDKs can have better, more stable URNs defined
> >> > and registered.
> >> >
> >> > I am more OK with changing the registry to infer different coders as
> >> > the SDK evolves (which would be detected and manually overwritten with
> >> > the old ones, on a case-by-case basis, if they still exist). This
> >> > should still be done with caution as it will make upgrading harder.
> >> > Highly composite, experimental coders should possibly be designed in
> >> > an intrinsically extensible way.
> >> >
> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >> >>
> >> >> That's really a pita. It's an important and impacting change.
> >> >>
> >> >> I would go to 1.
> >> >>
> >> >> For LTS, as already said, I would create a LTS branch and only cherry
> >> >> pick some changes. Using master as LTS release branch won't work
> IMHO.
> >> >>
> >> >> Regards
> >> >> JB
> >> >>
> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
> >> >>> For some extra context this change touches more than FileIO, in
> >> >>> reality this will affect updates in any file-based pipelines because
> >> >>> the metadata on each file will have now an extra field for the
> >> >>> lastModifiedDate.
> >> >>>
> >> >>> The PR looks perfect, only issue is the backwards compatibility
> Coder
> >> >>> question. Knowing that probably Dataflow is the only one affected, I
> >> >>> would like to know what can we do?
> >> >>>
> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
> versions
> >> >>> (which makes sense and is probably more aligned with the LTS
> >> >>> discussion)?
> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
> >> >>> Matadata object) ? In this case where is the right place to identify
> >> >>> and decide what coder to use?
> >> >>>
> >> >>> Other ideas... ?
> >> >>>
> >> >>> Last thing, the link that Luke shared does not seem to work (looks
> >> >>> like a googley-friendly URL, here it is the full URL for those
> >> >>> interested in the drain/update proposal:
> >> >>>
> >> >>> [2]
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
> wrote:
> >> >>>>
> >> >>>> I think the idea is that you would use one coder for paths where
> you don't need this information and would have FileIO provide a separate
> path that uses your updated coder.
> >> >>>> Existing users would not be impacted and users of the new FileIO
> that depend on this information would not be able to have updated their
> pipeline in the first place.
> >> >>>>
> >> >>>> If the feature in FileIO is experimental, we could choose to break
> it for existing users though since I don't know how feasible my suggestion
> above is.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com>
> wrote:
> >> >>>>>
> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
> >> >>>>>
> >> >>>>> It sounds like there's no explicit user contract about evolving
> Coder classes in the Java SDK and users might reasonably assume Coders to
> be stable between SDK versions. Thus, users of the Dataflow or Flink
> runners might reasonably expect that they can update the Java SDK version
> used in their pipeline when performing an update.
> >> >>>>>
> >> >>>>> Based in that understanding, evolving a class like Metadata might
> not be possible except in a major version bump where it's obvious to users
> to expect breaking changes and not to expect an "update" operation to work.
> >> >>>>>
> >> >>>>> It's not clear to me what changing the "name" of a coder would
> look like or whether that's a tenable solution here. Would that change be
> able to happen within the SDK itself, or is it something users would need
> to specify?
> >> >>
> >> >> --
> >> >> Jean-Baptiste Onofré
> >> >> jbonofre@apache.org
> >> >> http://blog.nanthrax.net
> >> >> Talend - http://www.talend.com
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>

Re: Evolving a Coder for an added field

Posted by Robert Bradshaw <ro...@google.com>.

Yes, a Coder author should be able to register a URN with a mapping
from (components + payload) -> Coder (and vice versa), and this should
be more lightweight than manually editing the proto files.
On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <th...@apache.org> wrote:
>
> +1
>
> I think that coders should be immutable/versioned. The SDK should know about all the available versions and be able to associate the data (stream or at rest) with the corresponding coder version via URN. We can also look how that is solved elsewhere, for example the Kafka schema registry.
>
> Today we only have a few URNs for standard coders: https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>
> I imagine we will need a coder registry where IOs and users can add their versioned coders also?
>
> Thanks,
> Thomas
>
>
> On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>
>> It makes sense to have a more concrete URN including the version.
>>
>> Good idea Robert.
>>
>> Regards
>> JB
>>
>> On 05/11/2018 16:52, Robert Bradshaw wrote:
>> > I think we'll want to allow upgrades across SDK versions. A runner
>> > should be able to recognize when a coder (or any other aspect of the
>> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>> > coders from sources/sinks, there's also possibly the expectation that
>> > one should be able to read data from a source written with that same
>> > coder across versions as well.)
>> >
>> > I think it really comes down to how coders are named. If we decide to
>> > let coders change arbitrarily between versions, probably the URN for
>> > SerializedJavaCoder should have the SDK version number in it. Coders
>> > that are stable across SDKs can have better, more stable URNs defined
>> > and registered.
>> >
>> > I am more OK with changing the registry to infer different coders as
>> > the SDK evolves (which would be detected and manually overwritten with
>> > the old ones, on a case-by-case basis, if they still exist). This
>> > should still be done with caution as it will make upgrading harder.
>> > Highly composite, experimental coders should possibly be designed in
>> > an intrinsically extensible way.
>> >
>> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>> >>
>> >> That's really a pita. It's an important and impacting change.
>> >>
>> >> I would go to 1.
>> >>
>> >> For LTS, as already said, I would create a LTS branch and only cherry
>> >> pick some changes. Using master as LTS release branch won't work IMHO.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>> >>> For some extra context this change touches more than FileIO, in
>> >>> reality this will affect updates in any file-based pipelines because
>> >>> the metadata on each file will have now an extra field for the
>> >>> lastModifiedDate.
>> >>>
>> >>> The PR looks perfect, only issue is the backwards compatibility Coder
>> >>> question. Knowing that probably Dataflow is the only one affected, I
>> >>> would like to know what can we do?
>> >>>
>> >>> [1] Should we merge and the Coder updatability be tied to SDK versions
>> >>> (which makes sense and is probably more aligned with the LTS
>> >>> discussion)?
>> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>> >>> Matadata object) ? In this case where is the right place to identify
>> >>> and decide what coder to use?
>> >>>
>> >>> Other ideas... ?
>> >>>
>> >>> Last thing, the link that Luke shared does not seem to work (looks
>> >>> like a googley-friendly URL, here it is the full URL for those
>> >>> interested in the drain/update proposal:
>> >>>
>> >>> [2] https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
>> >>>>
>> >>>> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
>> >>>> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
>> >>>>
>> >>>> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:
>> >>>>>
>> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>> >>>>>
>> >>>>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
>> >>>>>
>> >>>>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
>> >>>>>
>> >>>>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbonofre@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

Re: Evolving a Coder for an added field

Posted by Thomas Weise <th...@apache.org>.

+1

I think that coders should be immutable/versioned. The SDK should know
about all the available versions and be able to associate the data (stream
or at rest) with the corresponding coder version via URN. We can also look
how that is solved elsewhere, for example the Kafka schema registry.

Today we only have a few URNs for standard coders:
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617

I imagine we will need a coder registry where IOs and users can add their
versioned coders also?

Thanks,
Thomas


On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> It makes sense to have a more concrete URN including the version.
>
> Good idea Robert.
>
> Regards
> JB
>
> On 05/11/2018 16:52, Robert Bradshaw wrote:
> > I think we'll want to allow upgrades across SDK versions. A runner
> > should be able to recognize when a coder (or any other aspect of the
> > pipeline) has changed and adapt/reject accordingly. (Until we remove
> > coders from sources/sinks, there's also possibly the expectation that
> > one should be able to read data from a source written with that same
> > coder across versions as well.)
> >
> > I think it really comes down to how coders are named. If we decide to
> > let coders change arbitrarily between versions, probably the URN for
> > SerializedJavaCoder should have the SDK version number in it. Coders
> > that are stable across SDKs can have better, more stable URNs defined
> > and registered.
> >
> > I am more OK with changing the registry to infer different coders as
> > the SDK evolves (which would be detected and manually overwritten with
> > the old ones, on a case-by-case basis, if they still exist). This
> > should still be done with caution as it will make upgrading harder.
> > Highly composite, experimental coders should possibly be designed in
> > an intrinsically extensible way.
> >
> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >>
> >> That's really a pita. It's an important and impacting change.
> >>
> >> I would go to 1.
> >>
> >> For LTS, as already said, I would create a LTS branch and only cherry
> >> pick some changes. Using master as LTS release branch won't work IMHO.
> >>
> >> Regards
> >> JB
> >>
> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
> >>> For some extra context this change touches more than FileIO, in
> >>> reality this will affect updates in any file-based pipelines because
> >>> the metadata on each file will have now an extra field for the
> >>> lastModifiedDate.
> >>>
> >>> The PR looks perfect, only issue is the backwards compatibility Coder
> >>> question. Knowing that probably Dataflow is the only one affected, I
> >>> would like to know what can we do?
> >>>
> >>> [1] Should we merge and the Coder updatability be tied to SDK versions
> >>> (which makes sense and is probably more aligned with the LTS
> >>> discussion)?
> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
> >>> Matadata object) ? In this case where is the right place to identify
> >>> and decide what coder to use?
> >>>
> >>> Other ideas... ?
> >>>
> >>> Last thing, the link that Luke shared does not seem to work (looks
> >>> like a googley-friendly URL, here it is the full URL for those
> >>> interested in the drain/update proposal:
> >>>
> >>> [2]
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
> >>>>
> >>>> I think the idea is that you would use one coder for paths where you
> don't need this information and would have FileIO provide a separate path
> that uses your updated coder.
> >>>> Existing users would not be impacted and users of the new FileIO that
> depend on this information would not be able to have updated their pipeline
> in the first place.
> >>>>
> >>>> If the feature in FileIO is experimental, we could choose to break it
> for existing users though since I don't know how feasible my suggestion
> above is.
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com>
> wrote:
> >>>>>
> >>>>> Lukasz - Thanks for those links. That's very helpful context.
> >>>>>
> >>>>> It sounds like there's no explicit user contract about evolving
> Coder classes in the Java SDK and users might reasonably assume Coders to
> be stable between SDK versions. Thus, users of the Dataflow or Flink
> runners might reasonably expect that they can update the Java SDK version
> used in their pipeline when performing an update.
> >>>>>
> >>>>> Based in that understanding, evolving a class like Metadata might
> not be possible except in a major version bump where it's obvious to users
> to expect breaking changes and not to expect an "update" operation to work.
> >>>>>
> >>>>> It's not clear to me what changing the "name" of a coder would look
> like or whether that's a tenable solution here. Would that change be able
> to happen within the SDK itself, or is it something users would need to
> specify?
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Evolving a Coder for an added field

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

It makes sense to have a more concrete URN including the version.

Good idea Robert.

Regards
JB

On 05/11/2018 16:52, Robert Bradshaw wrote:
> I think we'll want to allow upgrades across SDK versions. A runner
> should be able to recognize when a coder (or any other aspect of the
> pipeline) has changed and adapt/reject accordingly. (Until we remove
> coders from sources/sinks, there's also possibly the expectation that
> one should be able to read data from a source written with that same
> coder across versions as well.)
> 
> I think it really comes down to how coders are named. If we decide to
> let coders change arbitrarily between versions, probably the URN for
> SerializedJavaCoder should have the SDK version number in it. Coders
> that are stable across SDKs can have better, more stable URNs defined
> and registered.
> 
> I am more OK with changing the registry to infer different coders as
> the SDK evolves (which would be detected and manually overwritten with
> the old ones, on a case-by-case basis, if they still exist). This
> should still be done with caution as it will make upgrading harder.
> Highly composite, experimental coders should possibly be designed in
> an intrinsically extensible way.
> 
> On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>
>> That's really a pita. It's an important and impacting change.
>>
>> I would go to 1.
>>
>> For LTS, as already said, I would create a LTS branch and only cherry
>> pick some changes. Using master as LTS release branch won't work IMHO.
>>
>> Regards
>> JB
>>
>> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>> For some extra context this change touches more than FileIO, in
>>> reality this will affect updates in any file-based pipelines because
>>> the metadata on each file will have now an extra field for the
>>> lastModifiedDate.
>>>
>>> The PR looks perfect, only issue is the backwards compatibility Coder
>>> question. Knowing that probably Dataflow is the only one affected, I
>>> would like to know what can we do?
>>>
>>> [1] Should we merge and the Coder updatability be tied to SDK versions
>>> (which makes sense and is probably more aligned with the LTS
>>> discussion)?
>>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>>> Matadata object) ? In this case where is the right place to identify
>>> and decide what coder to use?
>>>
>>> Other ideas... ?
>>>
>>> Last thing, the link that Luke shared does not seem to work (looks
>>> like a googley-friendly URL, here it is the full URL for those
>>> interested in the drain/update proposal:
>>>
>>> [2] https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
>>>> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
>>>>
>>>> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>>>>
>>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>>>
>>>>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
>>>>>
>>>>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
>>>>>
>>>>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Evolving a Coder for an added field

Posted by Robert Bradshaw <ro...@google.com>.

I think we'll want to allow upgrades across SDK versions. A runner
should be able to recognize when a coder (or any other aspect of the
pipeline) has changed and adapt/reject accordingly. (Until we remove
coders from sources/sinks, there's also possibly the expectation that
one should be able to read data from a source written with that same
coder across versions as well.)

I think it really comes down to how coders are named. If we decide to
let coders change arbitrarily between versions, probably the URN for
SerializedJavaCoder should have the SDK version number in it. Coders
that are stable across SDKs can have better, more stable URNs defined
and registered.

I am more OK with changing the registry to infer different coders as
the SDK evolves (which would be detected and manually overwritten with
the old ones, on a case-by-case basis, if they still exist). This
should still be done with caution as it will make upgrading harder.
Highly composite, experimental coders should possibly be designed in
an intrinsically extensible way.

On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
> That's really a pita. It's an important and impacting change.
>
> I would go to 1.
>
> For LTS, as already said, I would create a LTS branch and only cherry
> pick some changes. Using master as LTS release branch won't work IMHO.
>
> Regards
> JB
>
> On 05/11/2018 15:47, Ismaël Mejía wrote:
> > For some extra context this change touches more than FileIO, in
> > reality this will affect updates in any file-based pipelines because
> > the metadata on each file will have now an extra field for the
> > lastModifiedDate.
> >
> > The PR looks perfect, only issue is the backwards compatibility Coder
> > question. Knowing that probably Dataflow is the only one affected, I
> > would like to know what can we do?
> >
> > [1] Should we merge and the Coder updatability be tied to SDK versions
> > (which makes sense and is probably more aligned with the LTS
> > discussion)?
> > [2] Should we have a MetadataCoderV2? (does this imply a repeated
> > Matadata object) ? In this case where is the right place to identify
> > and decide what coder to use?
> >
> > Other ideas... ?
> >
> > Last thing, the link that Luke shared does not seem to work (looks
> > like a googley-friendly URL, here it is the full URL for those
> > interested in the drain/update proposal:
> >
> > [2] https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> > On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
> >>
> >> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
> >> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
> >>
> >> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
> >>
> >>
> >>
> >> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:
> >>>
> >>> Lukasz - Thanks for those links. That's very helpful context.
> >>>
> >>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
> >>>
> >>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
> >>>
> >>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Evolving a Coder for an added field

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

That's really a pita. It's an important and impacting change.

I would go to 1.

For LTS, as already said, I would create a LTS branch and only cherry
pick some changes. Using master as LTS release branch won't work IMHO.

Regards
JB

On 05/11/2018 15:47, Ismaël Mejía wrote:
> For some extra context this change touches more than FileIO, in
> reality this will affect updates in any file-based pipelines because
> the metadata on each file will have now an extra field for the
> lastModifiedDate.
> 
> The PR looks perfect, only issue is the backwards compatibility Coder
> question. Knowing that probably Dataflow is the only one affected, I
> would like to know what can we do?
> 
> [1] Should we merge and the Coder updatability be tied to SDK versions
> (which makes sense and is probably more aligned with the LTS
> discussion)?
> [2] Should we have a MetadataCoderV2? (does this imply a repeated
> Matadata object) ? In this case where is the right place to identify
> and decide what coder to use?
> 
> Other ideas... ?
> 
> Last thing, the link that Luke shared does not seem to work (looks
> like a googley-friendly URL, here it is the full URL for those
> interested in the drain/update proposal:
> 
> [2] https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
>> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
>>
>> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
>>
>>
>>
>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>>
>>> Lukasz - Thanks for those links. That's very helpful context.
>>>
>>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
>>>
>>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
>>>
>>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Evolving a Coder for an added field

Posted by Ismaël Mejía <ie...@gmail.com>.

For some extra context this change touches more than FileIO, in
reality this will affect updates in any file-based pipelines because
the metadata on each file will have now an extra field for the
lastModifiedDate.

The PR looks perfect, only issue is the backwards compatibility Coder
question. Knowing that probably Dataflow is the only one affected, I
would like to know what can we do?

[1] Should we merge and the Coder updatability be tied to SDK versions
(which makes sense and is probably more aligned with the LTS
discussion)?
[2] Should we have a MetadataCoderV2? (does this imply a repeated
Matadata object) ? In this case where is the right place to identify
and decide what coder to use?

Other ideas... ?

Last thing, the link that Luke shared does not seem to work (looks
like a googley-friendly URL, here it is the full URL for those
interested in the drain/update proposal:

[2] https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> wrote:
>
> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
>
> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
>
>
>
> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>> Lukasz - Thanks for those links. That's very helpful context.
>>
>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
>>
>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
>>
>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?

Re: Evolving a Coder for an added field

Posted by Lukasz Cwik <lc...@google.com>.

I think the idea is that you would use one coder for paths where you don't
need this information and would have FileIO provide a separate path that
uses your updated coder.
Existing users would not be impacted and users of the new FileIO that
depend on this information would not be able to have updated their pipeline
in the first place.

If the feature in FileIO is experimental, we could choose to break it for
existing users though since I don't know how feasible my suggestion above
is.

On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jk...@mozilla.com> wrote:

> Lukasz - Thanks for those links. That's very helpful context.
>
> It sounds like there's no explicit user contract about evolving Coder
> classes in the Java SDK and users might reasonably assume Coders to be
> stable between SDK versions. Thus, users of the Dataflow or Flink runners
> might reasonably expect that they can update the Java SDK version used in
> their pipeline when performing an update.
>
> Based in that understanding, evolving a class like Metadata might not be
> possible except in a major version bump where it's obvious to users to
> expect breaking changes and not to expect an "update" operation to work.
>
> It's not clear to me what changing the "name" of a coder would look like
> or whether that's a tenable solution here. Would that change be able to
> happen within the SDK itself, or is it something users would need to
> specify?
>

Re: Evolving a Coder for an added field

Posted by Jeff Klukas <jk...@mozilla.com>.

Lukasz - Thanks for those links. That's very helpful context.

It sounds like there's no explicit user contract about evolving Coder
classes in the Java SDK and users might reasonably assume Coders to be
stable between SDK versions. Thus, users of the Dataflow or Flink runners
might reasonably expect that they can update the Java SDK version used in
their pipeline when performing an update.

Based in that understanding, evolving a class like Metadata might not be
possible except in a major version bump where it's obvious to users to
expect breaking changes and not to expect an "update" operation to work.

It's not clear to me what changing the "name" of a coder would look like or
whether that's a tenable solution here. Would that change be able to happen
within the SDK itself, or is it something users would need to specify?

Re: Evolving a Coder for an added field

Posted by Lukasz Cwik <lc...@google.com>.

Can you give more insight into how the Flink update works because I was
only aware of the Dataflow one?


On Fri, Nov 2, 2018 at 12:35 PM Reuven Lax <re...@google.com> wrote:

> This is not quite true - people often update Flink pipelines as well.
>
> On Fri, Nov 2, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> +Reuven Lax <re...@google.com> for update proposal
>>
>> Dataflow is the only Apache Beam runner which has the capability of
>> updating pipelines. This page[1] describes many of the aspects of how it
>> works and specifically talks about coder changes:
>>
>>
>>    - *Changing the Coder for a step.* When you update a job, the Cloud
>>    Dataflow service preserves any data records currently buffered (for
>>    example, while windowing
>>    <https://beam.apache.org/documentation/programming-guide/#windowing> is
>>    resolving) and handles them in the replacement job. If the replacement job
>>    uses different or incompatible data encoding
>>    <https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety>,
>>    the Cloud Dataflow service will not be able to serialize or deserialize
>>    these records.
>>
>>    *Caution:* The Cloud Dataflow service currently cannot guarantee that
>>    changing a coder in your prior pipeline to an incompatible coder will cause
>>    the compatibility check to fail. It is recommended that you *do not* attempt
>>    to make backwards-incompatible changes to Coders when updating your
>>    pipeline; if your pipeline update succeeds but you encounter issues or
>>    errors in the resulting data, ensure that your replacement pipeline uses
>>    data encoding that's the same as, or at least compatible with, your prior
>>    job.
>>
>> There has been a proposal[2] for general update support within Apache
>> Beam with little traction for implementation outside of Dataflow.
>>
>> Looking at your code, it wouldn't work with update because encoded values
>> concatenated together without an element delimiter in many situations.
>> Hence when you decode a value using the past format with your new coder you
>> would read from the next value corrupting your read. If you really need to
>> change the encoding in a backwards incompatible way, you would need to
>> change the "name" of the coder which currently defaults to the class name.
>>
>> 1: https://cloud.google.com/dataflow/pipelines/updating-a-pipeline
>> 2: http://doc/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>>
>>
>> On Fri, Nov 2, 2018 at 5:44 AM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> I'm adding a new lastModifiedMillis field to MatchResult.Metadata [0]
>>> which requires also updating MetadataCoder, but it's not clear to me
>>> whether there are guidelines to follow when evolving a type when that
>>> changes the encoding.
>>>
>>> Is a user allowed to update Beam library versions as part of updating a
>>> pipeline? If so, there could be a situation where an updated pipeline is
>>> reading state that includes Metadata encoded without the new
>>> lastModifiedMillis field, which would cause a CodingException to be thrown.
>>>
>>> Is there prior art for evolving a type and its Coder? Should I be
>>> defensive and catch CodingException when attempting to decode the new
>>> field, providing a default value?
>>>
>>> [0] https://github.com/apache/beam/pull/6914
>>>
>>

Re: Evolving a Coder for an added field

Posted by Reuven Lax <re...@google.com>.

This is not quite true - people often update Flink pipelines as well.

On Fri, Nov 2, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote:

> +Reuven Lax <re...@google.com> for update proposal
>
> Dataflow is the only Apache Beam runner which has the capability of
> updating pipelines. This page[1] describes many of the aspects of how it
> works and specifically talks about coder changes:
>
>
>    - *Changing the Coder for a step.* When you update a job, the Cloud
>    Dataflow service preserves any data records currently buffered (for
>    example, while windowing
>    <https://beam.apache.org/documentation/programming-guide/#windowing> is
>    resolving) and handles them in the replacement job. If the replacement job
>    uses different or incompatible data encoding
>    <https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety>,
>    the Cloud Dataflow service will not be able to serialize or deserialize
>    these records.
>
>    *Caution:* The Cloud Dataflow service currently cannot guarantee that
>    changing a coder in your prior pipeline to an incompatible coder will cause
>    the compatibility check to fail. It is recommended that you *do not* attempt
>    to make backwards-incompatible changes to Coders when updating your
>    pipeline; if your pipeline update succeeds but you encounter issues or
>    errors in the resulting data, ensure that your replacement pipeline uses
>    data encoding that's the same as, or at least compatible with, your prior
>    job.
>
> There has been a proposal[2] for general update support within Apache Beam
> with little traction for implementation outside of Dataflow.
>
> Looking at your code, it wouldn't work with update because encoded values
> concatenated together without an element delimiter in many situations.
> Hence when you decode a value using the past format with your new coder you
> would read from the next value corrupting your read. If you really need to
> change the encoding in a backwards incompatible way, you would need to
> change the "name" of the coder which currently defaults to the class name.
>
> 1: https://cloud.google.com/dataflow/pipelines/updating-a-pipeline
> 2: http://doc/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>
>
> On Fri, Nov 2, 2018 at 5:44 AM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> I'm adding a new lastModifiedMillis field to MatchResult.Metadata [0]
>> which requires also updating MetadataCoder, but it's not clear to me
>> whether there are guidelines to follow when evolving a type when that
>> changes the encoding.
>>
>> Is a user allowed to update Beam library versions as part of updating a
>> pipeline? If so, there could be a situation where an updated pipeline is
>> reading state that includes Metadata encoded without the new
>> lastModifiedMillis field, which would cause a CodingException to be thrown.
>>
>> Is there prior art for evolving a type and its Coder? Should I be
>> defensive and catch CodingException when attempting to decode the new
>> field, providing a default value?
>>
>> [0] https://github.com/apache/beam/pull/6914
>>
>

Re: Evolving a Coder for an added field

Posted by Lukasz Cwik <lc...@google.com>.

+Reuven Lax <re...@google.com> for update proposal

Dataflow is the only Apache Beam runner which has the capability of
updating pipelines. This page[1] describes many of the aspects of how it
works and specifically talks about coder changes:

   - *Changing the Coder for a step.* When you update a job, the Cloud
   Dataflow service preserves any data records currently buffered (for
   example, while windowing
   <https://beam.apache.org/documentation/programming-guide/#windowing> is
   resolving) and handles them in the replacement job. If the replacement job
   uses different or incompatible data encoding
   <https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety>,
   the Cloud Dataflow service will not be able to serialize or deserialize
   these records.

   *Caution:* The Cloud Dataflow service currently cannot guarantee that
   changing a coder in your prior pipeline to an incompatible coder will cause
   the compatibility check to fail. It is recommended that you *do not* attempt
   to make backwards-incompatible changes to Coders when updating your
   pipeline; if your pipeline update succeeds but you encounter issues or
   errors in the resulting data, ensure that your replacement pipeline uses
   data encoding that's the same as, or at least compatible with, your prior
   job.

There has been a proposal[2] for general update support within Apache Beam
with little traction for implementation outside of Dataflow.

Looking at your code, it wouldn't work with update because encoded values
concatenated together without an element delimiter in many situations.
Hence when you decode a value using the past format with your new coder you
would read from the next value corrupting your read. If you really need to
change the encoding in a backwards incompatible way, you would need to
change the "name" of the coder which currently defaults to the class name.

1: https://cloud.google.com/dataflow/pipelines/updating-a-pipeline
2: http://doc/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY

On Fri, Nov 2, 2018 at 5:44 AM Jeff Klukas <jk...@mozilla.com> wrote:

> I'm adding a new lastModifiedMillis field to MatchResult.Metadata [0]
> which requires also updating MetadataCoder, but it's not clear to me
> whether there are guidelines to follow when evolving a type when that
> changes the encoding.
>
> Is a user allowed to update Beam library versions as part of updating a
> pipeline? If so, there could be a situation where an updated pipeline is
> reading state that includes Metadata encoded without the new
> lastModifiedMillis field, which would cause a CodingException to be thrown.
>
> Is there prior art for evolving a type and its Coder? Should I be
> defensive and catch CodingException when attempting to decode the new
> field, providing a default value?
>
> [0] https://github.com/apache/beam/pull/6914
>