You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Heejong Lee <he...@google.com> on 2019/04/03 23:55:24 UTC

[DISCUSS] change the encoding scheme of Python StrUtf8Coder

Hi all,

It looks like UTF-8 String Coder in Java and Python SDKs uses different
encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
input string before actual data bytes however StrUtf8Coder in Python SDK
directly encodes the input string to bytes value. For the last few weeks,
I've been testing and fixing cross-language IO transforms and this
discrepancy is a major blocker for me. IMO, we should unify the encoding
schemes of UTF8 strings across the different SDKs and make it a standard
coder. Any thoughts?

Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

On Thu, Apr 4, 2019 at 1:48 PM Kenneth Knowles <ke...@apache.org> wrote:

> I have to actually say that a collection of test cases is not a definition
> of a format. It is one of the pieces, and the other one is a textual
> description in a prominent, discoverable place.
>

A reference implementation can also serve the purpose of the textual
description, possibly well. But the point is that examples are not a spec.

Kenn


>
> Kenn
>
> On Thu, Apr 4, 2019 at 1:28 PM Lukasz Cwik <lc...@google.com> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> standard_coders.yaml[1] is where we are currently defining these
>>>> formats.
>>>> Unfortunately the Python SDK has its own copy[2].
>>>>
>>>
>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>>> for Python ? I didn't see a significant difference in definitions looking
>>> at few random coders there but I might have missed something. If there's no
>>> reason to maintain two, we should probably unify.
>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>>
>>>
>> Not certain as well. I did notice the timer coder definition didn't exist
>> in the Python copy.
>>
>>
>>>
>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>>> to the Java and Python SDKs to ensure interoperability.
>>>>
>>>> Robert Burke, does the Go SDK have a test where it uses
>>>> standard_coders.yaml and runs compatibility tests?
>>>>
>>>> Chamikara, creating new coder classes is a pain since the type -> coder
>>>> mapping per SDK language would select the non-well known type if we added a
>>>> new one to a language. If we swapped the default type->coder mapping, this
>>>> would still break update for pipelines forcing users to update their code
>>>> to select the non-well known type. If we don't change the default
>>>> type->coder mapping, the well known coder will gain little usage. I think
>>>> we should fix the Python coder to use the same encoding as Java for UTF-8
>>>> strings before there are too many Python SDK users.
>>>>
>>>
>>> I was thinking that may be we should just change the default UTF-8 coder
>>> for Fn API path which is experimental. Updating Python to do what's done
>>> for Java is fine if we agree that encoding used for Java should be the
>>> standard.
>>>
>>>
>> That is a good idea to use the Fn API experiment to control which gets
>> selected.
>>
>>
>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>> 2:
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>> 3: https://github.com/apache/beam/pull/8205
>>>>
>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> A URN defines the encoding.
>>>>>>
>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>> unnested one does not.
>>>>>>
>>>>>
>>>>> Could you clarify where we define the exact encoding ? I only see a
>>>>> URN for UTF-8 [1] while if you look at the implementations Java includes
>>>>> length in the encoding [1] while Python [1] does not.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>> Regarding backwards compatibility, I agree that we should probably not
>>>>> update existing coder classes. Probably we should just standardize the
>>>>> correct encoding (may be as a comment near corresponding URN in the
>>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Could this be a backwards-incompatible change that would break
>>>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>>>> and we change the coder, they would break?
>>>>>> > I know very little about coders, but since nobody has mentioned it,
>>>>>> I wanted to make sure we have it in mind.
>>>>>> > -P.
>>>>>> >
>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>>> encoding. Ideally some test data that different languages can use to drive
>>>>>> compliance testing.
>>>>>> >>
>>>>>> >> Kenn
>>>>>> >>
>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>>> Python would be reasonable in my opinion.
>>>>>> >>>
>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>>>> Go are always length prefixed (and reported to the Runner as
>>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>>> for strings, as Go natively treats strings as UTF8.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi all,
>>>>>> >>>>
>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>>> standard coder. Any thoughts?
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>>
>>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

I have to actually say that a collection of test cases is not a definition
of a format. It is one of the pieces, and the other one is a textual
description in a prominent, discoverable place.

Kenn

On Thu, Apr 4, 2019 at 1:28 PM Lukasz Cwik <lc...@google.com> wrote:

>
>
> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> standard_coders.yaml[1] is where we are currently defining these formats.
>>> Unfortunately the Python SDK has its own copy[2].
>>>
>>
>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>> for Python ? I didn't see a significant difference in definitions looking
>> at few random coders there but I might have missed something. If there's no
>> reason to maintain two, we should probably unify.
>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>
>>
> Not certain as well. I did notice the timer coder definition didn't exist
> in the Python copy.
>
>
>>
>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>> to the Java and Python SDKs to ensure interoperability.
>>>
>>> Robert Burke, does the Go SDK have a test where it uses
>>> standard_coders.yaml and runs compatibility tests?
>>>
>>> Chamikara, creating new coder classes is a pain since the type -> coder
>>> mapping per SDK language would select the non-well known type if we added a
>>> new one to a language. If we swapped the default type->coder mapping, this
>>> would still break update for pipelines forcing users to update their code
>>> to select the non-well known type. If we don't change the default
>>> type->coder mapping, the well known coder will gain little usage. I think
>>> we should fix the Python coder to use the same encoding as Java for UTF-8
>>> strings before there are too many Python SDK users.
>>>
>>
>> I was thinking that may be we should just change the default UTF-8 coder
>> for Fn API path which is experimental. Updating Python to do what's done
>> for Java is fine if we agree that encoding used for Java should be the
>> standard.
>>
>>
> That is a good idea to use the Fn API experiment to control which gets
> selected.
>
>
>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> 2:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>> 3: https://github.com/apache/beam/pull/8205
>>>
>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> A URN defines the encoding.
>>>>>
>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>> unnested one does not.
>>>>>
>>>>
>>>> Could you clarify where we define the exact encoding ? I only see a URN
>>>> for UTF-8 [1] while if you look at the implementations Java includes length
>>>> in the encoding [1] while Python [1] does not.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>> [3]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>
>>>>
>>>>
>>>>>
>>>>> We should define the spec clearly and have cross-language tests.
>>>>>
>>>>
>>>> +1
>>>>
>>>> Regarding backwards compatibility, I agree that we should probably not
>>>> update existing coder classes. Probably we should just standardize the
>>>> correct encoding (may be as a comment near corresponding URN in the
>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>
>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>>> wrote:
>>>>> >
>>>>> > Could this be a backwards-incompatible change that would break
>>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>>> and we change the coder, they would break?
>>>>> > I know very little about coders, but since nobody has mentioned it,
>>>>> I wanted to make sure we have it in mind.
>>>>> > -P.
>>>>> >
>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>> encoding. Ideally some test data that different languages can use to drive
>>>>> compliance testing.
>>>>> >>
>>>>> >> Kenn
>>>>> >>
>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>> Python would be reasonable in my opinion.
>>>>> >>>
>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>>> Go are always length prefixed (and reported to the Runner as
>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>> for strings, as Go natively treats strings as UTF8.
>>>>> >>>
>>>>> >>>
>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi all,
>>>>> >>>>
>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>> standard coder. Any thoughts?
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>>
>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Burke <ro...@frantil.com>.

On Mon, 8 Apr 2019 at 16:03, Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >
> >> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >> >>
> >> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >> >
> >> >> > Nested and unnested contexts are two different encodings. Can we
> just give them different URNs? We can even just express the length-prefixed
> UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
> >> >>
> >> >> It's not that simple, especially when it comes to composite
> encodings.
> >> >> E.g. for some coders, nested(C) == unnested(C), for some coders
> >> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
> >> >> something else altogether (e.g. when creating a kv coder, the first
> >> >> component must use nested context, and the second inherits the nested
> >> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't
> want
> >> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor
> does
> >> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
> >> >> may themselves be large and incrementally written (e.g.
> >> >> IterableCoder). Using distinct URNs doesn't work well if the runner
> is
> >> >> free to compose and decompose tuple, iterable, etc. coders that it
> >> >> doesn't understand.
> >> >>
> >> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
> >> >> we will continue to need the unnested context (lest we expect and
> >> >> produce length-prefixed coders in text files, as bigtable keys,
> etc.).
> >> >> On the other hand, almost all internal use is nested (due to sending
> >> >> elements around as part of element streams). The other place we cross
> >> >> over is LengthPrefixCoder that encodes its values using the unnested
> >> >> context prefixed by the unnested encoding length.
> >> >>
> >> >> Perhaps a step in the right direction would be to consistently use
> the
> >> >> unnested context everywhere but IOs (meaning when we talked about
> >> >> coders from the FnAPI perspective, they're *always* in the nested
> >> >> context, and hence always have the one and only encoding defined by
> >> >> that URN, including when wrapped by a length prefix coder (which
> would
> >> >> sometimes result in double length prefixing, but I think that's a
> >> >> price worth paying (or we could do something more clever like make
> >> >> length-prefix an (explicit) modifier on a coder rather than a new
> >> >> coder itself that would default to length prefixing (or some of the
> >> >> other delimiting schemes we've come up with) but allow the component
> >> >> coder to offer alternative length-prefix-compatible encodings))). IOs
> >> >> could be updated to take Parsers and Formatters (or whatever we call
> >> >> them) with the Coders in the constructors left as syntactic sugar
> >> >> until we could remove them in 3.0. As Luke says, we have a chance to
> >> >> fix our coders for portable pipelines now.
> >> >>
> >> >> In the (very) short term, we're stuck with a nested and unnested
> >> >> version of StringUtf8, just as we have for bytes, lest we change the
> >> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
> >> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
> >> >
> >> >
> >> > First, let's note that "nested" and "outer" are a misnomer. The
> distinction is whether it is the last thing encoded in the stream. In a
> KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the
> "outer" context though the value is nested. No doubt a good amount of
> confusion comes from the initial and continued use of this terminology.
> >>
> >> +1. I think of these as "self-delimiting" vs. "externally-delimited."
> >>
> >> > So, all that said, it is a simple fact that UTF-8 and length-prefixed
> UTF-8 are two different encodings. Encodings are the fundamental concept
> here and coders encapsulate two encodings, with some subtle and
> inconsistently-applied rules about when to use which encoding. I think we
> should still give them distinct URNs unless impossible. You've outlined
> some steps to clarify the situation.
> >>
> >> Currently, we have a one-to-two relationship between Coders and
> >> encodings, and a one-to-one relationship between URNs and encoders.
> >>
> >> To make the first one-to-one, we would either have to make
> >> StringUtf8Coder unsuitable for TextIO (letting it always prefix its
> >> contents with a length) or unsuitable for the key part of a KV, the
> >> element of an iterable, etc. (where the length is required).
> >
> > Definitely in favor of dropping TextIO from the conversation. Just need
> to think about compatibility, but I would save that for after the idealized
> end goal is defined, and hack it somehow. to the one-to-one vs one-to-two,
> can we have:
> >
> >  - SDK coder -> proto: be smart about knowing what the context is and
> uses the appropriate URN (this is when building the graph)
>
> So you'd have to day "give me the outer version of yourself" or "give
> me the inner version of yourself" which would (sometimes) get passed
> down to components. And this extra bit would also have to be set in
> the object -> proto caches. It's the same nested/outer logic, just put
> in a different place.
>
> The context here is almost always "inner" because even if it's the
> PCollection element's coder, it must be encoded in a self-delimiting
> way as per the (current) data-plane protocol. On the other hand, if
> you have a PCollection<KV> that's later used as a map side input, the
> K appears alone (potentiall outer) during lookups. Of course one can
> always use inner and not care about the redundancy.
>
> >  - proto -> SDK: instantiate a variant of the coder that is fixed to one
> context (this is when you get a process bundle instruction from the runner)
> >
> > I think this is more-or-less the idea you mean by...
> >
> >> Alternatively we could give Coders the ability to return the
> >> nested/unnested version of themselves, but this also gets messy
> >> because it depends on the ultimate outer context which we don't always
> >> have at hand (and leads to surprises, e.g. asking for the key coder of
> >> a KV coder may not return the same coder that was used to construct
> >> it).
>
> I was thinking of doing this at graph construction time, but at proto
> translation time could be better (though it complicates that).
>
> > ... and I don't think it would be too messy, because fixing the context
> would happen around submission time, yes? Basically, it keep "Java Coder"
> as a vestigial concept that compiles away, much like PValue/PInput/POutput.
>
> Unfortunately nested vs. unnested is already a model-level concept
> (e.g. Python does the same, and has been changed to match java
> exactly). Not that we couldn't change this for FnApi running.
>
> >> On the other hand, breaking the one-to-one relationship of Coders and
> >> URNs is also undesirable, because it forces a choice when serializing
> >> to protos (where one may not always know all the contexts it's used
> >> in)
> >
> > When would you not know the context? I posit that toProto should always
> know the context.
> >
> >> and makes the round-trip through protos non-idempotent (if the URN
> >> gets decoded back into a Coder that does not have this dual-defined
> >> encoding).
> >
> > This doesn't seem like such a problem. Of course, I would say that, as I
> am proposing to do just this :-)
>
> I think we'll have even more of this (happening even at pipeline
> construction time, e.g. for cross-language pipelines).
>
> >> Also, if a runner knows about special pairs of URNs that
> >> represent the nested/unnested encodings of the same type, and rules
> >> like knowing constraints on components of KVs, etc., this seems an
> >> even worse situation than we're already in.
> >
> > Perhaps even worse than this would be to enshrine the idea of
> self-delimiting/non-self-delimiting coders as a flag in the proto, so
> runners know to flip it. That is really bad, making the model inherit
> accidental complexity from the Java SDK. But is this flag always just a
> length prefix, in practice? If so, then the smarts to add to toProto would
> cause the self-delimiting variant of UTF-8 coder to be a composite
> LengthPrefix(UTF-8) coder. Then the model doesn't change and the runner
> does not need to be aware of special pairs of URNs.
>
> No, most of the "basic" coders (e.g. ints, doubles, iterables, ...)
> are already self-delimiting.
>
> >> > I think the most meaningful issue is runners manipulating coders. But
> the runner should be able to instruct the SDK (where the coder actually
> executes) about which encoding to use. I need to think through some
> examples of where the SDK tells the runner the encoding of something and
> those where the runner fabricates steps and instructs the SDK what encoding
> to use for elements. GroupByKey, GetKeys, etc, are examples where the
> context will change.
> >>
> >> Combiner lifting and side inputs are good candidates as well.
> >
> >
> > GBK example:
> >
> >  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
> >  - Output: KvCoder[Outer]<KeyCoder[Inner],
> IterableCoder[Outer]<ValueCoder[Inner]>>
> >
> > In this case, it is the SDK providing both, so it can easily make the
> ValueCoder[Inner] a LengthPrefix(ValueCoder[Outer]). If the ValueCoder is
> already length-prefixed there is a cost. That could be detected and elided
> I think; might need additional methods but not if the ValueCoder toProto
> already encoded in a readable way.
> >
> > GBKVaiGBKO example, where the runner chooses a particular implementation
> strategy. Assuming the client language is not Java so inner/outer switching
> cannot be done unless explicitly represented in the proto.
> >
> >  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
> >  - Intermediate: KvCoder[Outer]<KeyCoder[Inner],
> IterableCoder[Outer]<WindowedValueCoder[Outer]<ValueCoder[Inner]>>>
> >
> > Since the input ValueCoder is always Outer, it is easy to add a
> length-prefix without knowing what it is. So from these examples, I
> hypothesize that the hard case is when there is a user coder in Inner
> encoding and the runner wants to turn it into Outer encoding, but the Inner
> encoding does not explicitly indicate that it is a LengthPrefix(Outer)
> encoding. I believe that this is not a problem - the runner can just leave
> it as the Inner encoding since every Inner encoding is a safe Outer
> encoding (but not vice-versa). I don't know of a real example, but if there
> were a runner-inserted GetKeys then the KeyCoder[Inner] might *want* to
> change to KeyCoder[Outer] but that wouldn't be a requirement.
> >
> > For side inputs, PCollection<ValueCoder[Outer]> the runner may want to
> put a bunch of elements in some data structure that requires
> self-delimiting. Same logic, it is always easy to move from Outer to Inner.
>
> Moving an Iterable from Outer to Inner by adding a length-prefix is
> really expensive, but useless as Outer is already length-delimited.
> Moving a KV coder form Outer to Inner involves changing the second
> component. I don't think we want to bake special rules like this into
> the model.
>
>
> This email is already very long, but in summary I think the right
> answer is to just get rid of Outer altogether (except possibly for
> IOs, which we'd only preserve for legacy reasons until 3.0). The
> question remains whether we should allow a Coder C (as part of its
> spec, known to all that know the URN of C) to declare an alternative
> coder C' for LengthPrefix(C), that is different than e ->
> varLen(size(encode(C, e))) + encode(C, e), possibly (often?) C' == C,
> to avoid double-length-prefixing.
>

Technically this is how the Go SDK treats all "custom coders" which in this
case means, anything that's not defined by the beam spec.
Such coders are effectively C', the user side is provided an element to
encode, and is expected to produce bytes (C), which the system
automatically Length Prefixes, and indicates to the runner the bytes are
LengthPrefix(C).
On decode, when it sees LengthPrefix(C), it peels off the length prefix N,
and then provides the user coder with the length bytes N.

The real trick is that there's no dedicated LengthPrefix coder in the SDK,
it's simply a part of the CustomCoder implementation which wraps all user
coders. The way I see it, the LengthPrefix for user elements is for the
Runner's benefit, not the User's, and having the runner "own" that part of
the encoding avoids weird issues where it might get striped unintentionally

There's a more trusting approach where a user coder is handed the stream
directly (io.Reader in Go's instance), and manages it's own in memory byte
buffers, and reads from the stream, but, in the SDK's present state, there
hasn't been a requirement for that level of efficiency yet, in any of the
Go SDK profiling benchmarks I've looked at. I'm largely sending around
protocol buffers, rather than Ints, floats or other ~1-2 bytewords where a
LP represents higher overhead.

Robert Burke (the other Robert)

>
> - Robert
>
>
> >> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >>
> >> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com>
> wrote:
> >> >> >> >
> >> >> >> > Robert, does nested/unnested context work properly for Java?
> >> >> >>
> >> >> >> I believe so. It is similar to the bytes coder, that prefixes vs.
> not
> >> >> >> based on the context.
> >> >> >>
> >> >> >> > I can see that the Context is fixed to NESTED[1] and the encode
> method with the Context parameter is marked as deprecated[2].
> >> >> >> >
> >> >> >> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> >> >> >> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
> >> >> >>
> >> >> >> That doesn't mean it's unused, e.g.
> >> >> >>
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
> >> >> >>
> >> >> >> (and I'm sure there's others).
> >> >> >>
> >> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >> >>
> >> >> >> >> I don't know why there are two separate copies of
> >> >> >> >> standard_coders.yaml--originally there was just one (though it
> did
> >> >> >> >> live in the Python directory). I'm guessing a copy was made
> rather
> >> >> >> >> than just pointing both to the new location, but that
> completely
> >> >> >> >> defeats the point. I can't seem to access JIRA right now; could
> >> >> >> >> someone file an issue to resolve this?
> >> >> >> >>
> >> >> >> >> I also think the spec should be next to the definition of the
> URN,
> >> >> >> >> that's one of the reason the URNs were originally in a
> markdown file
> >> >> >> >> (to encourage good documentation, literate programming style).
> Many
> >> >> >> >> coders already have their specs there.
> >> >> >> >>
> >> >> >> >> Regarding backwards compatibility, we can't change existing
> coders,
> >> >> >> >> and making new coders won't help with inference ('cause
> changing that
> >> >> >> >> would also be backwards incompatible). Fortunately, I think
> we're
> >> >> >> >> already doing the consistent thing here: In both Python and
> Java the
> >> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested*
> context
> >> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the
> coder is
> >> >> >> >> used in a *nested* context.
> >> >> >> >>
> >> >> >> >> I'd really like to see the whole nested/unnested context go
> away, but
> >> >> >> >> that'll probably require Beam 3.0; it causes way more
> confusion than
> >> >> >> >> the couple of bytes it saves in a couple of places.
> >> >> >> >>
> >> >> >> >> - Robert
> >> >> >> >>
> >> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >
> >> >> >> >> > My 2cents is that the "Textual description" should be part
> of the documentation of the URNs on the Proto messages, since that's the
> common place. I've added a short description for the varints for example,
> and we already have lenghthier format & protocol descriptions there for
> iterables and similar.
> >> >> >> >> >
> >> >> >> >> > The proto [1] *can be* the spec if we want it to be.
> >> >> >> >> >
> >> >> >> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >> >> >> >
> >> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >>>
> >> >> >> >> >>> We should probably move the "java" version of the yaml
> file [1] to a common location rather than deep in the java hierarchy, or
> copying it for Go and Python, but that can be a separate task. It's
> probably non-trivial since it looks like it's part of a java resources
> structure.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Seems like /model is a good place for this if we don't want
> to invent a new language-independent hierarchy.
> >> >> >> >> >>
> >> >> >> >> >> Kenn
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>>
> >> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but
> it shouldn't be difficult, given pointers to the Java and Python variants
> of the tests to crib from [2]. Care would need to be taken so that Beam Go
> SDK users (such as they are) aren't forced to run them, and not have the
> yaml file to read. I'd suggest putting it with the integration tests [3].
> >> >> >> >> >>>
> >> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK
> side work. [4]
> >> >> >> >> >>>
> >> >> >> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com>
> wrote:
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
> lcwik@google.com> wrote:
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently
> defining these formats.
> >> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's
> a separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >> >> >> >>>>> Also, seems like we haven't added the definition for
> UTF-8 coder yet.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> Not certain as well. I did notice the timer coder
> definition didn't exist in the Python copy.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Here is an example PR[3] that adds the
> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
> interoperability.
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since
> the type -> coder mapping per SDK language would select the non-well known
> type if we added a new one to a language. If we swapped the default
> type->coder mapping, this would still break update for pipelines forcing
> users to update their code to select the non-well known type. If we don't
> change the default type->coder mapping, the well known coder will gain
> little usage. I think we should fix the Python coder to use the same
> encoding as Java for UTF-8 strings before there are too many Python SDK
> users.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> I was thinking that may be we should just change the
> default UTF-8 coder for Fn API path which is experimental. Updating Python
> to do what's done for Java is fine if we agree that encoding used for Java
> should be the standard.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> That is a good idea to use the Fn API experiment to
> control which gets selected.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> A URN defines the encoding.
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for
> a Coder (defined
> >> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in
> both Java and
> >> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int
> length, and the
> >> >> >> >> >>>>>>>> unnested one does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ?
> I only see a URN for UTF-8 [1] while if you look at the implementations
> Java includes length in the encoding [1] while Python [1] does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >> >> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >> >> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> We should define the spec clearly and have
> cross-language tests.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> +1
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we
> should probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> pabloem@google.com> wrote:
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that
> would break pipelines from upgrading? If they have data in-flight in
> between operators, and we change the coder, they would break?
> >> >> >> >> >>>>>>>> > I know very little about coders, but since nobody
> has mentioned it, I wanted to make sure we have it in mind.
> >> >> >> >> >>>>>>>> > -P.
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see
> that string UTF-8 was added to the proto enum, but it needs a written spec
> of the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Kenn
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard
> coder " URN in the protos, but I don't think that developed beyond Java, so
> adding it to Python would be reasonable in my opinion.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
> presently which for Go are always length prefixed (and reported to the
> Runner as LP+CustomCoder). It would be straight forward to add the correct
> handling for strings, as Go natively treats strings as UTF8.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> heejong@google.com> wrote:
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Hi all,
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and
> Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK
> puts the varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Thomas Weise <th...@apache.org>.

I opened https://github.com/apache/beam/pull/8319 to eliminate the
duplicate yaml file (and cover timestamp coder for the Python SDK). Would
appreciate if someone could take a look. (PR doesn't affect the
StrUtf8Coder subject, but it is required to fix a timer bug.)

Thanks,
Thomas


On Fri, Apr 12, 2019 at 2:17 PM Lukasz Cwik <lc...@google.com> wrote:

> This is a minor point Robert Burke but having access to the "stream" when
> decoding/encoding could mean that your reading/writing from the underlying
> transport channel directly and not needing to copy the bytes into/from
> memory.
>
>
>
> On Wed, Apr 10, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> On Mon, Apr 8, 2019 at 4:03 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> This email is already very long, but in summary I think the right
>>> answer is to just get rid of Outer altogether (except possibly for
>>> IOs, which we'd only preserve for legacy reasons until 3.0).
>>>
>>> - Robert
>>>
>>
>> I had forgotten that compatibility from legacy to portable pipelines is
>> not a concern. So, can this be made into a plan? Would it start from this
>> point:
>>
>>  - Java classes have to exist as-is from a user's point of view, for
>> compatibility
>>  - Portable pipelines should include only self-delimiting encodings (tiny
>> primitives do not require length prefix, large iterables get special
>> treatment)
>>  - When creating a Coder from a portable proto, instantiate one that is
>> fixed to the Inner context
>>
>> I would skip having known relationship between a coder and a
>> self-delimiting variant.
>>
>> Kenn
>>
>>
>>> >> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >> >> >>
>>> >> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com>
>>> wrote:
>>> >> >> >> >
>>> >> >> >> > Robert, does nested/unnested context work properly for Java?
>>> >> >> >>
>>> >> >> >> I believe so. It is similar to the bytes coder, that prefixes
>>> vs. not
>>> >> >> >> based on the context.
>>> >> >> >>
>>> >> >> >> > I can see that the Context is fixed to NESTED[1] and the
>>> encode method with the Context parameter is marked as deprecated[2].
>>> >> >> >> >
>>> >> >> >> > [1]:
>>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>>> >> >> >> > [2]:
>>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>>> >> >> >>
>>> >> >> >> That doesn't mean it's unused, e.g.
>>> >> >> >>
>>> >> >> >>
>>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>>> >> >> >>
>>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>>> >> >> >>
>>> >> >> >> (and I'm sure there's others).
>>> >> >> >>
>>> >> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >> >> >> >>
>>> >> >> >> >> I don't know why there are two separate copies of
>>> >> >> >> >> standard_coders.yaml--originally there was just one (though
>>> it did
>>> >> >> >> >> live in the Python directory). I'm guessing a copy was made
>>> rather
>>> >> >> >> >> than just pointing both to the new location, but that
>>> completely
>>> >> >> >> >> defeats the point. I can't seem to access JIRA right now;
>>> could
>>> >> >> >> >> someone file an issue to resolve this?
>>> >> >> >> >>
>>> >> >> >> >> I also think the spec should be next to the definition of
>>> the URN,
>>> >> >> >> >> that's one of the reason the URNs were originally in a
>>> markdown file
>>> >> >> >> >> (to encourage good documentation, literate programming
>>> style). Many
>>> >> >> >> >> coders already have their specs there.
>>> >> >> >> >>
>>> >> >> >> >> Regarding backwards compatibility, we can't change existing
>>> coders,
>>> >> >> >> >> and making new coders won't help with inference ('cause
>>> changing that
>>> >> >> >> >> would also be backwards incompatible). Fortunately, I think
>>> we're
>>> >> >> >> >> already doing the consistent thing here: In both Python and
>>> Java the
>>> >> >> >> >> raw UTF-8 encoded bytes are encoded when used in an
>>> *unnested* context
>>> >> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when
>>> the coder is
>>> >> >> >> >> used in a *nested* context.
>>> >> >> >> >>
>>> >> >> >> >> I'd really like to see the whole nested/unnested context go
>>> away, but
>>> >> >> >> >> that'll probably require Beam 3.0; it causes way more
>>> confusion than
>>> >> >> >> >> the couple of bytes it saves in a couple of places.
>>> >> >> >> >>
>>> >> >> >> >> - Robert
>>> >> >> >> >>
>>> >> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <
>>> robert@frantil.com> wrote:
>>> >> >> >> >> >
>>> >> >> >> >> > My 2cents is that the "Textual description" should be part
>>> of the documentation of the URNs on the Proto messages, since that's the
>>> common place. I've added a short description for the varints for example,
>>> and we already have lenghthier format & protocol descriptions there for
>>> iterables and similar.
>>> >> >> >> >> >
>>> >> >> >> >> > The proto [1] *can be* the spec if we want it to be.
>>> >> >> >> >> >
>>> >> >> >> >> > [1]:
>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>>> >> >> >> >> >
>>> >> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <
>>> kenn@apache.org> wrote:
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
>>> robert@frantil.com> wrote:
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> We should probably move the "java" version of the yaml
>>> file [1] to a common location rather than deep in the java hierarchy, or
>>> copying it for Go and Python, but that can be a separate task. It's
>>> probably non-trivial since it looks like it's part of a java resources
>>> structure.
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> Seems like /model is a good place for this if we don't
>>> want to invent a new language-independent hierarchy.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Kenn
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation,
>>> but it shouldn't be difficult, given pointers to the Java and Python
>>> variants of the tests to crib from [2]. Care would need to be taken so that
>>> Beam Go SDK users (such as they are) aren't forced to run them, and not
>>> have the yaml file to read. I'd suggest putting it with the integration
>>> tests [3].
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK
>>> side work. [4]
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> >> >> >> >> >>> 2:
>>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>>> >> >> >> >> >>> 3:
>>> https://github.com/apache/beam/tree/master/sdks/go/test
>>> >> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <
>>> lcwik@google.com> wrote:
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
>>> chamikara@google.com> wrote:
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
>>> lcwik@google.com> wrote:
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently
>>> defining these formats.
>>> >> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why
>>> there's  a separate copy for Python ? I didn't see a significant difference
>>> in definitions looking at few random coders there but I might have missed
>>> something. If there's no reason to maintain two, we should probably unify.
>>> >> >> >> >> >>>>> Also, seems like we haven't added the definition for
>>> UTF-8 coder yet.
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> Not certain as well. I did notice the timer coder
>>> definition didn't exist in the Python copy.
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> Here is an example PR[3] that adds the
>>> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
>>> interoperability.
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it
>>> uses standard_coders.yaml and runs compatibility tests?
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since
>>> the type -> coder mapping per SDK language would select the non-well known
>>> type if we added a new one to a language. If we swapped the default
>>> type->coder mapping, this would still break update for pipelines forcing
>>> users to update their code to select the non-well known type. If we don't
>>> change the default type->coder mapping, the well known coder will gain
>>> little usage. I think we should fix the Python coder to use the same
>>> encoding as Java for UTF-8 strings before there are too many Python SDK
>>> users.
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>> I was thinking that may be we should just change the
>>> default UTF-8 coder for Fn API path which is experimental. Updating Python
>>> to do what's done for Java is fine if we agree that encoding used for Java
>>> should be the standard.
>>> >> >> >> >> >>>>>
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> That is a good idea to use the Fn API experiment to
>>> control which gets selected.
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> >> >> >> >> >>>>>> 2:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>> >> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>>> >> >> >> >> >>>>>>
>>> >> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>> chamikara@google.com> wrote:
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>> A URN defines the encoding.
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined
>>> for a Coder (defined
>>> >> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC,
>>> in both Java and
>>> >> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int
>>> length, and the
>>> >> >> >> >> >>>>>>>> unnested one does not.
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> Could you clarify where we define the exact encoding
>>> ? I only see a URN for UTF-8 [1] while if you look at the implementations
>>> Java includes length in the encoding [1] while Python [1] does not.
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> [1]
>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>> >> >> >> >> >>>>>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>> >> >> >> >> >>>>>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>> We should define the spec clearly and have
>>> cross-language tests.
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> +1
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we
>>> should probably not update existing coder classes. Probably we should just
>>> standardize the correct encoding (may be as a comment near corresponding
>>> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>>> >> >> >> >> >>>>>>>
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>>
>>> >> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
>>> pabloem@google.com> wrote:
>>> >> >> >> >> >>>>>>>> >
>>> >> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change
>>> that would break pipelines from upgrading? If they have data in-flight in
>>> between operators, and we change the coder, they would break?
>>> >> >> >> >> >>>>>>>> > I know very little about coders, but since nobody
>>> has mentioned it, I wanted to make sure we have it in mind.
>>> >> >> >> >> >>>>>>>> > -P.
>>> >> >> >> >> >>>>>>>> >
>>> >> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
>>> kenn@apache.org> wrote:
>>> >> >> >> >> >>>>>>>> >>
>>> >> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I
>>> see that string UTF-8 was added to the proto enum, but it needs a written
>>> spec of the encoding. Ideally some test data that different languages can
>>> use to drive compliance testing.
>>> >> >> >> >> >>>>>>>> >>
>>> >> >> >> >> >>>>>>>> >> Kenn
>>> >> >> >> >> >>>>>>>> >>
>>> >> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
>>> robert@frantil.com> wrote:
>>> >> >> >> >> >>>>>>>> >>>
>>> >> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard
>>> coder " URN in the protos, but I don't think that developed beyond Java, so
>>> adding it to Python would be reasonable in my opinion.
>>> >> >> >> >> >>>>>>>> >>>
>>> >> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
>>> presently which for Go are always length prefixed (and reported to the
>>> Runner as LP+CustomCoder). It would be straight forward to add the correct
>>> handling for strings, as Go natively treats strings as UTF8.
>>> >> >> >> >> >>>>>>>> >>>
>>> >> >> >> >> >>>>>>>> >>>
>>> >> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
>>> heejong@google.com> wrote:
>>> >> >> >> >> >>>>>>>> >>>>
>>> >> >> >> >> >>>>>>>> >>>> Hi all,
>>> >> >> >> >> >>>>>>>> >>>>
>>> >> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and
>>> Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK
>>> puts the varint length of the input string before actual data bytes however
>>> StrUtf8Coder in Python SDK directly encodes the input string to bytes
>>> value. For the last few weeks, I've been testing and fixing cross-language
>>> IO transforms and this discrepancy is a major blocker for me. IMO, we
>>> should unify the encoding schemes of UTF8 strings across the different SDKs
>>> and make it a standard coder. Any thoughts?
>>> >> >> >> >> >>>>>>>> >>>>
>>> >> >> >> >> >>>>>>>> >>>> Thanks,
>>>
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Lukasz Cwik <lc...@google.com>.

This is a minor point Robert Burke but having access to the "stream" when
decoding/encoding could mean that your reading/writing from the underlying
transport channel directly and not needing to copy the bytes into/from
memory.



On Wed, Apr 10, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:

> On Mon, Apr 8, 2019 at 4:03 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> This email is already very long, but in summary I think the right
>> answer is to just get rid of Outer altogether (except possibly for
>> IOs, which we'd only preserve for legacy reasons until 3.0).
>>
>> - Robert
>>
>
> I had forgotten that compatibility from legacy to portable pipelines is
> not a concern. So, can this be made into a plan? Would it start from this
> point:
>
>  - Java classes have to exist as-is from a user's point of view, for
> compatibility
>  - Portable pipelines should include only self-delimiting encodings (tiny
> primitives do not require length prefix, large iterables get special
> treatment)
>  - When creating a Coder from a portable proto, instantiate one that is
> fixed to the Inner context
>
> I would skip having known relationship between a coder and a
> self-delimiting variant.
>
> Kenn
>
>
>> >> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >> >>
>> >> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com>
>> wrote:
>> >> >> >> >
>> >> >> >> > Robert, does nested/unnested context work properly for Java?
>> >> >> >>
>> >> >> >> I believe so. It is similar to the bytes coder, that prefixes
>> vs. not
>> >> >> >> based on the context.
>> >> >> >>
>> >> >> >> > I can see that the Context is fixed to NESTED[1] and the
>> encode method with the Context parameter is marked as deprecated[2].
>> >> >> >> >
>> >> >> >> > [1]:
>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> >> >> >> > [2]:
>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>> >> >> >>
>> >> >> >> That doesn't mean it's unused, e.g.
>> >> >> >>
>> >> >> >>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> >> >> >>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>> >> >> >>
>> >> >> >> (and I'm sure there's others).
>> >> >> >>
>> >> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >> >> >>
>> >> >> >> >> I don't know why there are two separate copies of
>> >> >> >> >> standard_coders.yaml--originally there was just one (though
>> it did
>> >> >> >> >> live in the Python directory). I'm guessing a copy was made
>> rather
>> >> >> >> >> than just pointing both to the new location, but that
>> completely
>> >> >> >> >> defeats the point. I can't seem to access JIRA right now;
>> could
>> >> >> >> >> someone file an issue to resolve this?
>> >> >> >> >>
>> >> >> >> >> I also think the spec should be next to the definition of the
>> URN,
>> >> >> >> >> that's one of the reason the URNs were originally in a
>> markdown file
>> >> >> >> >> (to encourage good documentation, literate programming
>> style). Many
>> >> >> >> >> coders already have their specs there.
>> >> >> >> >>
>> >> >> >> >> Regarding backwards compatibility, we can't change existing
>> coders,
>> >> >> >> >> and making new coders won't help with inference ('cause
>> changing that
>> >> >> >> >> would also be backwards incompatible). Fortunately, I think
>> we're
>> >> >> >> >> already doing the consistent thing here: In both Python and
>> Java the
>> >> >> >> >> raw UTF-8 encoded bytes are encoded when used in an
>> *unnested* context
>> >> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the
>> coder is
>> >> >> >> >> used in a *nested* context.
>> >> >> >> >>
>> >> >> >> >> I'd really like to see the whole nested/unnested context go
>> away, but
>> >> >> >> >> that'll probably require Beam 3.0; it causes way more
>> confusion than
>> >> >> >> >> the couple of bytes it saves in a couple of places.
>> >> >> >> >>
>> >> >> >> >> - Robert
>> >> >> >> >>
>> >> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <
>> robert@frantil.com> wrote:
>> >> >> >> >> >
>> >> >> >> >> > My 2cents is that the "Textual description" should be part
>> of the documentation of the URNs on the Proto messages, since that's the
>> common place. I've added a short description for the varints for example,
>> and we already have lenghthier format & protocol descriptions there for
>> iterables and similar.
>> >> >> >> >> >
>> >> >> >> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >> >> >> >
>> >> >> >> >> > [1]:
>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >> >> >> >
>> >> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <
>> kenn@apache.org> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
>> robert@frantil.com> wrote:
>> >> >> >> >> >>>
>> >> >> >> >> >>> We should probably move the "java" version of the yaml
>> file [1] to a common location rather than deep in the java hierarchy, or
>> copying it for Go and Python, but that can be a separate task. It's
>> probably non-trivial since it looks like it's part of a java resources
>> structure.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Seems like /model is a good place for this if we don't
>> want to invent a new language-independent hierarchy.
>> >> >> >> >> >>
>> >> >> >> >> >> Kenn
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>>
>> >> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation,
>> but it shouldn't be difficult, given pointers to the Java and Python
>> variants of the tests to crib from [2]. Care would need to be taken so that
>> Beam Go SDK users (such as they are) aren't forced to run them, and not
>> have the yaml file to read. I'd suggest putting it with the integration
>> tests [3].
>> >> >> >> >> >>>
>> >> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK
>> side work. [4]
>> >> >> >> >> >>>
>> >> >> >> >> >>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >> >>> 2:
>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >> >> >> >>> 3:
>> https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <
>> lcwik@google.com> wrote:
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
>> chamikara@google.com> wrote:
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
>> lcwik@google.com> wrote:
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently
>> defining these formats.
>> >> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's
>> a separate copy for Python ? I didn't see a significant difference in
>> definitions looking at few random coders there but I might have missed
>> something. If there's no reason to maintain two, we should probably unify.
>> >> >> >> >> >>>>> Also, seems like we haven't added the definition for
>> UTF-8 coder yet.
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> Not certain as well. I did notice the timer coder
>> definition didn't exist in the Python copy.
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> Here is an example PR[3] that adds the
>> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
>> interoperability.
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it
>> uses standard_coders.yaml and runs compatibility tests?
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since
>> the type -> coder mapping per SDK language would select the non-well known
>> type if we added a new one to a language. If we swapped the default
>> type->coder mapping, this would still break update for pipelines forcing
>> users to update their code to select the non-well known type. If we don't
>> change the default type->coder mapping, the well known coder will gain
>> little usage. I think we should fix the Python coder to use the same
>> encoding as Java for UTF-8 strings before there are too many Python SDK
>> users.
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>> I was thinking that may be we should just change the
>> default UTF-8 coder for Fn API path which is experimental. Updating Python
>> to do what's done for Java is fine if we agree that encoding used for Java
>> should be the standard.
>> >> >> >> >> >>>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> That is a good idea to use the Fn API experiment to
>> control which gets selected.
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >> >>>>>> 2:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >> >> >> >>>>>>
>> >> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>> chamikara@google.com> wrote:
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>> A URN defines the encoding.
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined
>> for a Coder (defined
>> >> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in
>> both Java and
>> >> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int
>> length, and the
>> >> >> >> >> >>>>>>>> unnested one does not.
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>> Could you clarify where we define the exact encoding
>> ? I only see a URN for UTF-8 [1] while if you look at the implementations
>> Java includes length in the encoding [1] while Python [1] does not.
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>> [1]
>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >> >> >> >>>>>>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >> >> >> >>>>>>> [3]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>> We should define the spec clearly and have
>> cross-language tests.
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>> +1
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we
>> should probably not update existing coder classes. Probably we should just
>> standardize the correct encoding (may be as a comment near corresponding
>> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >> >> >> >> >>>>>>>
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>>
>> >> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
>> pabloem@google.com> wrote:
>> >> >> >> >> >>>>>>>> >
>> >> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that
>> would break pipelines from upgrading? If they have data in-flight in
>> between operators, and we change the coder, they would break?
>> >> >> >> >> >>>>>>>> > I know very little about coders, but since nobody
>> has mentioned it, I wanted to make sure we have it in mind.
>> >> >> >> >> >>>>>>>> > -P.
>> >> >> >> >> >>>>>>>> >
>> >> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
>> kenn@apache.org> wrote:
>> >> >> >> >> >>>>>>>> >>
>> >> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I
>> see that string UTF-8 was added to the proto enum, but it needs a written
>> spec of the encoding. Ideally some test data that different languages can
>> use to drive compliance testing.
>> >> >> >> >> >>>>>>>> >>
>> >> >> >> >> >>>>>>>> >> Kenn
>> >> >> >> >> >>>>>>>> >>
>> >> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
>> robert@frantil.com> wrote:
>> >> >> >> >> >>>>>>>> >>>
>> >> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard
>> coder " URN in the protos, but I don't think that developed beyond Java, so
>> adding it to Python would be reasonable in my opinion.
>> >> >> >> >> >>>>>>>> >>>
>> >> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
>> presently which for Go are always length prefixed (and reported to the
>> Runner as LP+CustomCoder). It would be straight forward to add the correct
>> handling for strings, as Go natively treats strings as UTF8.
>> >> >> >> >> >>>>>>>> >>>
>> >> >> >> >> >>>>>>>> >>>
>> >> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
>> heejong@google.com> wrote:
>> >> >> >> >> >>>>>>>> >>>>
>> >> >> >> >> >>>>>>>> >>>> Hi all,
>> >> >> >> >> >>>>>>>> >>>>
>> >> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and
>> Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK
>> puts the varint length of the input string before actual data bytes however
>> StrUtf8Coder in Python SDK directly encodes the input string to bytes
>> value. For the last few weeks, I've been testing and fixing cross-language
>> IO transforms and this discrepancy is a major blocker for me. IMO, we
>> should unify the encoding schemes of UTF8 strings across the different SDKs
>> and make it a standard coder. Any thoughts?
>> >> >> >> >> >>>>>>>> >>>>
>> >> >> >> >> >>>>>>>> >>>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

On Mon, Apr 8, 2019 at 4:03 PM Robert Bradshaw <ro...@google.com> wrote:

> This email is already very long, but in summary I think the right
> answer is to just get rid of Outer altogether (except possibly for
> IOs, which we'd only preserve for legacy reasons until 3.0).
>
> - Robert
>

I had forgotten that compatibility from legacy to portable pipelines is not
a concern. So, can this be made into a plan? Would it start from this point:

 - Java classes have to exist as-is from a user's point of view, for
compatibility
 - Portable pipelines should include only self-delimiting encodings (tiny
primitives do not require length prefix, large iterables get special
treatment)
 - When creating a Coder from a portable proto, instantiate one that is
fixed to the Inner context

I would skip having known relationship between a coder and a
self-delimiting variant.

Kenn


> >> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >>
> >> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com>
> wrote:
> >> >> >> >
> >> >> >> > Robert, does nested/unnested context work properly for Java?
> >> >> >>
> >> >> >> I believe so. It is similar to the bytes coder, that prefixes vs.
> not
> >> >> >> based on the context.
> >> >> >>
> >> >> >> > I can see that the Context is fixed to NESTED[1] and the encode
> method with the Context parameter is marked as deprecated[2].
> >> >> >> >
> >> >> >> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> >> >> >> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
> >> >> >>
> >> >> >> That doesn't mean it's unused, e.g.
> >> >> >>
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
> >> >> >>
> >> >> >> (and I'm sure there's others).
> >> >> >>
> >> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >> >>
> >> >> >> >> I don't know why there are two separate copies of
> >> >> >> >> standard_coders.yaml--originally there was just one (though it
> did
> >> >> >> >> live in the Python directory). I'm guessing a copy was made
> rather
> >> >> >> >> than just pointing both to the new location, but that
> completely
> >> >> >> >> defeats the point. I can't seem to access JIRA right now; could
> >> >> >> >> someone file an issue to resolve this?
> >> >> >> >>
> >> >> >> >> I also think the spec should be next to the definition of the
> URN,
> >> >> >> >> that's one of the reason the URNs were originally in a
> markdown file
> >> >> >> >> (to encourage good documentation, literate programming style).
> Many
> >> >> >> >> coders already have their specs there.
> >> >> >> >>
> >> >> >> >> Regarding backwards compatibility, we can't change existing
> coders,
> >> >> >> >> and making new coders won't help with inference ('cause
> changing that
> >> >> >> >> would also be backwards incompatible). Fortunately, I think
> we're
> >> >> >> >> already doing the consistent thing here: In both Python and
> Java the
> >> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested*
> context
> >> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the
> coder is
> >> >> >> >> used in a *nested* context.
> >> >> >> >>
> >> >> >> >> I'd really like to see the whole nested/unnested context go
> away, but
> >> >> >> >> that'll probably require Beam 3.0; it causes way more
> confusion than
> >> >> >> >> the couple of bytes it saves in a couple of places.
> >> >> >> >>
> >> >> >> >> - Robert
> >> >> >> >>
> >> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >
> >> >> >> >> > My 2cents is that the "Textual description" should be part
> of the documentation of the URNs on the Proto messages, since that's the
> common place. I've added a short description for the varints for example,
> and we already have lenghthier format & protocol descriptions there for
> iterables and similar.
> >> >> >> >> >
> >> >> >> >> > The proto [1] *can be* the spec if we want it to be.
> >> >> >> >> >
> >> >> >> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >> >> >> >
> >> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >>>
> >> >> >> >> >>> We should probably move the "java" version of the yaml
> file [1] to a common location rather than deep in the java hierarchy, or
> copying it for Go and Python, but that can be a separate task. It's
> probably non-trivial since it looks like it's part of a java resources
> structure.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Seems like /model is a good place for this if we don't want
> to invent a new language-independent hierarchy.
> >> >> >> >> >>
> >> >> >> >> >> Kenn
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>>
> >> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but
> it shouldn't be difficult, given pointers to the Java and Python variants
> of the tests to crib from [2]. Care would need to be taken so that Beam Go
> SDK users (such as they are) aren't forced to run them, and not have the
> yaml file to read. I'd suggest putting it with the integration tests [3].
> >> >> >> >> >>>
> >> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK
> side work. [4]
> >> >> >> >> >>>
> >> >> >> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com>
> wrote:
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
> lcwik@google.com> wrote:
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently
> defining these formats.
> >> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's
> a separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >> >> >> >>>>> Also, seems like we haven't added the definition for
> UTF-8 coder yet.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> Not certain as well. I did notice the timer coder
> definition didn't exist in the Python copy.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Here is an example PR[3] that adds the
> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
> interoperability.
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since
> the type -> coder mapping per SDK language would select the non-well known
> type if we added a new one to a language. If we swapped the default
> type->coder mapping, this would still break update for pipelines forcing
> users to update their code to select the non-well known type. If we don't
> change the default type->coder mapping, the well known coder will gain
> little usage. I think we should fix the Python coder to use the same
> encoding as Java for UTF-8 strings before there are too many Python SDK
> users.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> I was thinking that may be we should just change the
> default UTF-8 coder for Fn API path which is experimental. Updating Python
> to do what's done for Java is fine if we agree that encoding used for Java
> should be the standard.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> That is a good idea to use the Fn API experiment to
> control which gets selected.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> A URN defines the encoding.
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for
> a Coder (defined
> >> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in
> both Java and
> >> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int
> length, and the
> >> >> >> >> >>>>>>>> unnested one does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ?
> I only see a URN for UTF-8 [1] while if you look at the implementations
> Java includes length in the encoding [1] while Python [1] does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >> >> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >> >> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> We should define the spec clearly and have
> cross-language tests.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> +1
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we
> should probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> pabloem@google.com> wrote:
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that
> would break pipelines from upgrading? If they have data in-flight in
> between operators, and we change the coder, they would break?
> >> >> >> >> >>>>>>>> > I know very little about coders, but since nobody
> has mentioned it, I wanted to make sure we have it in mind.
> >> >> >> >> >>>>>>>> > -P.
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see
> that string UTF-8 was added to the proto enum, but it needs a written spec
> of the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Kenn
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard
> coder " URN in the protos, but I don't think that developed beyond Java, so
> adding it to Python would be reasonable in my opinion.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
> presently which for Go are always length prefixed (and reported to the
> Runner as LP+CustomCoder). It would be straight forward to add the correct
> handling for strings, as Go natively treats strings as UTF8.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> heejong@google.com> wrote:
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Hi all,
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and
> Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK
> puts the varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <ro...@google.com> wrote:
>> >>
>> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >> >
>> >> > Nested and unnested contexts are two different encodings. Can we just give them different URNs? We can even just express the length-prefixed UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>> >>
>> >> It's not that simple, especially when it comes to composite encodings.
>> >> E.g. for some coders, nested(C) == unnested(C), for some coders
>> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> >> something else altogether (e.g. when creating a kv coder, the first
>> >> component must use nested context, and the second inherits the nested
>> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> >> may themselves be large and incrementally written (e.g.
>> >> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> >> free to compose and decompose tuple, iterable, etc. coders that it
>> >> doesn't understand.
>> >>
>> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> >> we will continue to need the unnested context (lest we expect and
>> >> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> >> On the other hand, almost all internal use is nested (due to sending
>> >> elements around as part of element streams). The other place we cross
>> >> over is LengthPrefixCoder that encodes its values using the unnested
>> >> context prefixed by the unnested encoding length.
>> >>
>> >> Perhaps a step in the right direction would be to consistently use the
>> >> unnested context everywhere but IOs (meaning when we talked about
>> >> coders from the FnAPI perspective, they're *always* in the nested
>> >> context, and hence always have the one and only encoding defined by
>> >> that URN, including when wrapped by a length prefix coder (which would
>> >> sometimes result in double length prefixing, but I think that's a
>> >> price worth paying (or we could do something more clever like make
>> >> length-prefix an (explicit) modifier on a coder rather than a new
>> >> coder itself that would default to length prefixing (or some of the
>> >> other delimiting schemes we've come up with) but allow the component
>> >> coder to offer alternative length-prefix-compatible encodings))). IOs
>> >> could be updated to take Parsers and Formatters (or whatever we call
>> >> them) with the Coders in the constructors left as syntactic sugar
>> >> until we could remove them in 3.0. As Luke says, we have a chance to
>> >> fix our coders for portable pipelines now.
>> >>
>> >> In the (very) short term, we're stuck with a nested and unnested
>> >> version of StringUtf8, just as we have for bytes, lest we change the
>> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>> >
>> >
>> > First, let's note that "nested" and "outer" are a misnomer. The distinction is whether it is the last thing encoded in the stream. In a KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the "outer" context though the value is nested. No doubt a good amount of confusion comes from the initial and continued use of this terminology.
>>
>> +1. I think of these as "self-delimiting" vs. "externally-delimited."
>>
>> > So, all that said, it is a simple fact that UTF-8 and length-prefixed UTF-8 are two different encodings. Encodings are the fundamental concept here and coders encapsulate two encodings, with some subtle and inconsistently-applied rules about when to use which encoding. I think we should still give them distinct URNs unless impossible. You've outlined some steps to clarify the situation.
>>
>> Currently, we have a one-to-two relationship between Coders and
>> encodings, and a one-to-one relationship between URNs and encoders.
>>
>> To make the first one-to-one, we would either have to make
>> StringUtf8Coder unsuitable for TextIO (letting it always prefix its
>> contents with a length) or unsuitable for the key part of a KV, the
>> element of an iterable, etc. (where the length is required).
>
> Definitely in favor of dropping TextIO from the conversation. Just need to think about compatibility, but I would save that for after the idealized end goal is defined, and hack it somehow. to the one-to-one vs one-to-two, can we have:
>
>  - SDK coder -> proto: be smart about knowing what the context is and uses the appropriate URN (this is when building the graph)

So you'd have to day "give me the outer version of yourself" or "give
me the inner version of yourself" which would (sometimes) get passed
down to components. And this extra bit would also have to be set in
the object -> proto caches. It's the same nested/outer logic, just put
in a different place.

The context here is almost always "inner" because even if it's the
PCollection element's coder, it must be encoded in a self-delimiting
way as per the (current) data-plane protocol. On the other hand, if
you have a PCollection<KV> that's later used as a map side input, the
K appears alone (potentiall outer) during lookups. Of course one can
always use inner and not care about the redundancy.

>  - proto -> SDK: instantiate a variant of the coder that is fixed to one context (this is when you get a process bundle instruction from the runner)
>
> I think this is more-or-less the idea you mean by...
>
>> Alternatively we could give Coders the ability to return the
>> nested/unnested version of themselves, but this also gets messy
>> because it depends on the ultimate outer context which we don't always
>> have at hand (and leads to surprises, e.g. asking for the key coder of
>> a KV coder may not return the same coder that was used to construct
>> it).

I was thinking of doing this at graph construction time, but at proto
translation time could be better (though it complicates that).

> ... and I don't think it would be too messy, because fixing the context would happen around submission time, yes? Basically, it keep "Java Coder" as a vestigial concept that compiles away, much like PValue/PInput/POutput.

Unfortunately nested vs. unnested is already a model-level concept
(e.g. Python does the same, and has been changed to match java
exactly). Not that we couldn't change this for FnApi running.

>> On the other hand, breaking the one-to-one relationship of Coders and
>> URNs is also undesirable, because it forces a choice when serializing
>> to protos (where one may not always know all the contexts it's used
>> in)
>
> When would you not know the context? I posit that toProto should always know the context.
>
>> and makes the round-trip through protos non-idempotent (if the URN
>> gets decoded back into a Coder that does not have this dual-defined
>> encoding).
>
> This doesn't seem like such a problem. Of course, I would say that, as I am proposing to do just this :-)

I think we'll have even more of this (happening even at pipeline
construction time, e.g. for cross-language pipelines).

>> Also, if a runner knows about special pairs of URNs that
>> represent the nested/unnested encodings of the same type, and rules
>> like knowing constraints on components of KVs, etc., this seems an
>> even worse situation than we're already in.
>
> Perhaps even worse than this would be to enshrine the idea of self-delimiting/non-self-delimiting coders as a flag in the proto, so runners know to flip it. That is really bad, making the model inherit accidental complexity from the Java SDK. But is this flag always just a length prefix, in practice? If so, then the smarts to add to toProto would cause the self-delimiting variant of UTF-8 coder to be a composite LengthPrefix(UTF-8) coder. Then the model doesn't change and the runner does not need to be aware of special pairs of URNs.

No, most of the "basic" coders (e.g. ints, doubles, iterables, ...)
are already self-delimiting.

>> > I think the most meaningful issue is runners manipulating coders. But the runner should be able to instruct the SDK (where the coder actually executes) about which encoding to use. I need to think through some examples of where the SDK tells the runner the encoding of something and those where the runner fabricates steps and instructs the SDK what encoding to use for elements. GroupByKey, GetKeys, etc, are examples where the context will change.
>>
>> Combiner lifting and side inputs are good candidates as well.
>
>
> GBK example:
>
>  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
>  - Output: KvCoder[Outer]<KeyCoder[Inner], IterableCoder[Outer]<ValueCoder[Inner]>>
>
> In this case, it is the SDK providing both, so it can easily make the ValueCoder[Inner] a LengthPrefix(ValueCoder[Outer]). If the ValueCoder is already length-prefixed there is a cost. That could be detected and elided I think; might need additional methods but not if the ValueCoder toProto already encoded in a readable way.
>
> GBKVaiGBKO example, where the runner chooses a particular implementation strategy. Assuming the client language is not Java so inner/outer switching cannot be done unless explicitly represented in the proto.
>
>  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
>  - Intermediate: KvCoder[Outer]<KeyCoder[Inner], IterableCoder[Outer]<WindowedValueCoder[Outer]<ValueCoder[Inner]>>>
>
> Since the input ValueCoder is always Outer, it is easy to add a length-prefix without knowing what it is. So from these examples, I hypothesize that the hard case is when there is a user coder in Inner encoding and the runner wants to turn it into Outer encoding, but the Inner encoding does not explicitly indicate that it is a LengthPrefix(Outer) encoding. I believe that this is not a problem - the runner can just leave it as the Inner encoding since every Inner encoding is a safe Outer encoding (but not vice-versa). I don't know of a real example, but if there were a runner-inserted GetKeys then the KeyCoder[Inner] might *want* to change to KeyCoder[Outer] but that wouldn't be a requirement.
>
> For side inputs, PCollection<ValueCoder[Outer]> the runner may want to put a bunch of elements in some data structure that requires self-delimiting. Same logic, it is always easy to move from Outer to Inner.

Moving an Iterable from Outer to Inner by adding a length-prefix is
really expensive, but useless as Outer is already length-delimited.
Moving a KV coder form Outer to Inner involves changing the second
component. I don't think we want to bake special rules like this into
the model.


This email is already very long, but in summary I think the right
answer is to just get rid of Outer altogether (except possibly for
IOs, which we'd only preserve for legacy reasons until 3.0). The
question remains whether we should allow a Coder C (as part of its
spec, known to all that know the URN of C) to declare an alternative
coder C' for LengthPrefix(C), that is different than e ->
varLen(size(encode(C, e))) + encode(C, e), possibly (often?) C' == C,
to avoid double-length-prefixing.

- Robert


>> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com> wrote:
>> >> >>
>> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>> >> >> >
>> >> >> > Robert, does nested/unnested context work properly for Java?
>> >> >>
>> >> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> >> >> based on the context.
>> >> >>
>> >> >> > I can see that the Context is fixed to NESTED[1] and the encode method with the Context parameter is marked as deprecated[2].
>> >> >> >
>> >> >> > [1]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> >> >> > [2]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>> >> >>
>> >> >> That doesn't mean it's unused, e.g.
>> >> >>
>> >> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> >> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>> >> >>
>> >> >> (and I'm sure there's others).
>> >> >>
>> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com> wrote:
>> >> >> >>
>> >> >> >> I don't know why there are two separate copies of
>> >> >> >> standard_coders.yaml--originally there was just one (though it did
>> >> >> >> live in the Python directory). I'm guessing a copy was made rather
>> >> >> >> than just pointing both to the new location, but that completely
>> >> >> >> defeats the point. I can't seem to access JIRA right now; could
>> >> >> >> someone file an issue to resolve this?
>> >> >> >>
>> >> >> >> I also think the spec should be next to the definition of the URN,
>> >> >> >> that's one of the reason the URNs were originally in a markdown file
>> >> >> >> (to encourage good documentation, literate programming style). Many
>> >> >> >> coders already have their specs there.
>> >> >> >>
>> >> >> >> Regarding backwards compatibility, we can't change existing coders,
>> >> >> >> and making new coders won't help with inference ('cause changing that
>> >> >> >> would also be backwards incompatible). Fortunately, I think we're
>> >> >> >> already doing the consistent thing here: In both Python and Java the
>> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> >> >> >> used in a *nested* context.
>> >> >> >>
>> >> >> >> I'd really like to see the whole nested/unnested context go away, but
>> >> >> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> >> >> the couple of bytes it saves in a couple of places.
>> >> >> >>
>> >> >> >> - Robert
>> >> >> >>
>> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >> >
>> >> >> >> > My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar.
>> >> >> >> >
>> >> >> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >> >> >
>> >> >> >> > [1]: https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >> >> >
>> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >> >>>
>> >> >> >> >>> We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Seems like /model is a good place for this if we don't want to invent a new language-independent hierarchy.
>> >> >> >> >>
>> >> >> >> >> Kenn
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be difficult, given pointers to the Java and Python variants of the tests to crib from [2]. Care would need to be taken so that Beam Go SDK users (such as they are) aren't forced to run them, and not have the yaml file to read. I'd suggest putting it with the integration tests [3].
>> >> >> >> >>>
>> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >> >> >> >>>
>> >> >> >> >>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >>> 2: https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>> >> >> >> >>>>
>> >> >> >> >>>>
>> >> >> >> >>>>
>> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining these formats.
>> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy for Python ? I didn't see a significant difference in definitions looking at few random coders there but I might have missed something. If there's no reason to maintain two, we should probably unify.
>> >> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>> >> >> >> >>>>>
>> >> >> >> >>>>
>> >> >> >> >>>> Not certain as well. I did notice the timer coder definition didn't exist in the Python copy.
>> >> >> >> >>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability.
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses standard_coders.yaml and runs compatibility tests?
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the type -> coder mapping per SDK language would select the non-well known type if we added a new one to a language. If we swapped the default type->coder mapping, this would still break update for pipelines forcing users to update their code to select the non-well known type. If we don't change the default type->coder mapping, the well known coder will gain little usage. I think we should fix the Python coder to use the same encoding as Java for UTF-8 strings before there are too many Python SDK users.
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> I was thinking that may be we should just change the default UTF-8 coder for Fn API path which is experimental. Updating Python to do what's done for Java is fine if we agree that encoding used for Java should be the standard.
>> >> >> >> >>>>>
>> >> >> >> >>>>
>> >> >> >> >>>> That is a good idea to use the Fn API experiment to control which gets selected.
>> >> >> >> >>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >>>>>> 2: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> A URN defines the encoding.
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >> >> >> >>>>>>>> unnested one does not.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I only see a URN for UTF-8 [1] while if you look at the implementations Java includes length in the encoding [1] while Python [1] does not.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> [1] https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >> >> >>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >> >> >>>>>>> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> We should define the spec clearly and have cross-language tests.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> +1
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should probably not update existing coder classes. Probably we should just standardize the correct encoding (may be as a comment near corresponding URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >> >> >> >>>>>>>> >
>> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
>> >> >> >> >>>>>>>> > I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
>> >> >> >> >>>>>>>> > -P.
>> >> >> >> >>>>>>>> >
>> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> Kenn
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> Hi all,
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw <ro...@google.com> wrote:

> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <ke...@apache.org> wrote:
> >
> >
> >
> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org> wrote:
> >> >
> >> > Nested and unnested contexts are two different encodings. Can we just
> give them different URNs? We can even just express the length-prefixed
> UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
> >>
> >> It's not that simple, especially when it comes to composite encodings.
> >> E.g. for some coders, nested(C) == unnested(C), for some coders
> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
> >> something else altogether (e.g. when creating a kv coder, the first
> >> component must use nested context, and the second inherits the nested
> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
> >> may themselves be large and incrementally written (e.g.
> >> IterableCoder). Using distinct URNs doesn't work well if the runner is
> >> free to compose and decompose tuple, iterable, etc. coders that it
> >> doesn't understand.
> >>
> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
> >> we will continue to need the unnested context (lest we expect and
> >> produce length-prefixed coders in text files, as bigtable keys, etc.).
> >> On the other hand, almost all internal use is nested (due to sending
> >> elements around as part of element streams). The other place we cross
> >> over is LengthPrefixCoder that encodes its values using the unnested
> >> context prefixed by the unnested encoding length.
> >>
> >> Perhaps a step in the right direction would be to consistently use the
> >> unnested context everywhere but IOs (meaning when we talked about
> >> coders from the FnAPI perspective, they're *always* in the nested
> >> context, and hence always have the one and only encoding defined by
> >> that URN, including when wrapped by a length prefix coder (which would
> >> sometimes result in double length prefixing, but I think that's a
> >> price worth paying (or we could do something more clever like make
> >> length-prefix an (explicit) modifier on a coder rather than a new
> >> coder itself that would default to length prefixing (or some of the
> >> other delimiting schemes we've come up with) but allow the component
> >> coder to offer alternative length-prefix-compatible encodings))). IOs
> >> could be updated to take Parsers and Formatters (or whatever we call
> >> them) with the Coders in the constructors left as syntactic sugar
> >> until we could remove them in 3.0. As Luke says, we have a chance to
> >> fix our coders for portable pipelines now.
> >>
> >> In the (very) short term, we're stuck with a nested and unnested
> >> version of StringUtf8, just as we have for bytes, lest we change the
> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
> >
> >
> > First, let's note that "nested" and "outer" are a misnomer. The
> distinction is whether it is the last thing encoded in the stream. In a
> KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the
> "outer" context though the value is nested. No doubt a good amount of
> confusion comes from the initial and continued use of this terminology.
>
> +1. I think of these as "self-delimiting" vs. "externally-delimited."
>
> > So, all that said, it is a simple fact that UTF-8 and length-prefixed
> UTF-8 are two different encodings. Encodings are the fundamental concept
> here and coders encapsulate two encodings, with some subtle and
> inconsistently-applied rules about when to use which encoding. I think we
> should still give them distinct URNs unless impossible. You've outlined
> some steps to clarify the situation.
>
> Currently, we have a one-to-two relationship between Coders and
> encodings, and a one-to-one relationship between URNs and encoders.
>
> To make the first one-to-one, we would either have to make
> StringUtf8Coder unsuitable for TextIO (letting it always prefix its
> contents with a length) or unsuitable for the key part of a KV, the
> element of an iterable, etc. (where the length is required).
>

Definitely in favor of dropping TextIO from the conversation. Just need to
think about compatibility, but I would save that for after the idealized
end goal is defined, and hack it somehow. to the one-to-one vs one-to-two,
can we have:

 - SDK coder -> proto: be smart about knowing what the context is and uses
the appropriate URN (this is when building the graph)
 - proto -> SDK: instantiate a variant of the coder that is fixed to one
context (this is when you get a process bundle instruction from the runner)

I think this is more-or-less the idea you mean by...

Alternatively we could give Coders the ability to return the
> nested/unnested version of themselves, but this also gets messy
> because it depends on the ultimate outer context which we don't always
> have at hand (and leads to surprises, e.g. asking for the key coder of
> a KV coder may not return the same coder that was used to construct
> it).
>

... and I don't think it would be too messy, because fixing the context
would happen around submission time, yes? Basically, it keep "Java Coder"
as a vestigial concept that compiles away, much like PValue/PInput/POutput.


> On the other hand, breaking the one-to-one relationship of Coders and
> URNs is also undesirable, because it forces a choice when serializing
> to protos (where one may not always know all the contexts it's used
> in)


When would you not know the context? I posit that toProto should always
know the context.


> and makes the round-trip through protos non-idempotent (if the URN
> gets decoded back into a Coder that does not have this dual-defined
> encoding).


This doesn't seem like such a problem. Of course, I would say that, as I am
proposing to do just this :-)

Also, if a runner knows about special pairs of URNs that
> represent the nested/unnested encodings of the same type, and rules
> like knowing constraints on components of KVs, etc., this seems an
> even worse situation than we're already in.
>

Perhaps even worse than this would be to enshrine the idea of
self-delimiting/non-self-delimiting coders as a flag in the proto, so
runners know to flip it. That is really bad, making the model inherit
accidental complexity from the Java SDK. But is this flag always just a
length prefix, in practice? If so, then the smarts to add to toProto would
cause the self-delimiting variant of UTF-8 coder to be a composite
LengthPrefix(UTF-8) coder. Then the model doesn't change and the runner
does not need to be aware of special pairs of URNs.


> > I think the most meaningful issue is runners manipulating coders. But
> the runner should be able to instruct the SDK (where the coder actually
> executes) about which encoding to use. I need to think through some
> examples of where the SDK tells the runner the encoding of something and
> those where the runner fabricates steps and instructs the SDK what encoding
> to use for elements. GroupByKey, GetKeys, etc, are examples where the
> context will change.
>
> Combiner lifting and side inputs are good candidates as well.
>

GBK example:

 - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
 - Output: KvCoder[Outer]<KeyCoder[Inner],
IterableCoder[Outer]<ValueCoder[Inner]>>

In this case, it is the SDK providing both, so it can easily make the
ValueCoder[Inner] a LengthPrefix(ValueCoder[Outer]). If the ValueCoder is
already length-prefixed there is a cost. That could be detected and elided
I think; might need additional methods but not if the ValueCoder toProto
already encoded in a readable way.

GBKVaiGBKO example, where the runner chooses a particular implementation
strategy. Assuming the client language is not Java so inner/outer switching
cannot be done unless explicitly represented in the proto.

 - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
 - Intermediate: KvCoder[Outer]<KeyCoder[Inner],
IterableCoder[Outer]<WindowedValueCoder[Outer]<ValueCoder[Inner]>>>

Since the input ValueCoder is always Outer, it is easy to add a
length-prefix without knowing what it is. So from these examples, I
hypothesize that the hard case is when there is a user coder in Inner
encoding and the runner wants to turn it into Outer encoding, but the Inner
encoding does not explicitly indicate that it is a LengthPrefix(Outer)
encoding. I believe that this is not a problem - the runner can just leave
it as the Inner encoding since every Inner encoding is a safe Outer
encoding (but not vice-versa). I don't know of a real example, but if there
were a runner-inserted GetKeys then the KeyCoder[Inner] might *want* to
change to KeyCoder[Outer] but that wouldn't be a requirement.

For side inputs, PCollection<ValueCoder[Outer]> the runner may want to put
a bunch of elements in some data structure that requires self-delimiting.
Same logic, it is always easy to move from Outer to Inner.

Kenn


> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >> >>
> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com>
> wrote:
> >> >> >
> >> >> > Robert, does nested/unnested context work properly for Java?
> >> >>
> >> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
> >> >> based on the context.
> >> >>
> >> >> > I can see that the Context is fixed to NESTED[1] and the encode
> method with the Context parameter is marked as deprecated[2].
> >> >> >
> >> >> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> >> >> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
> >> >>
> >> >> That doesn't mean it's unused, e.g.
> >> >>
> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
> >> >>
> >> >> (and I'm sure there's others).
> >> >>
> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >>
> >> >> >> I don't know why there are two separate copies of
> >> >> >> standard_coders.yaml--originally there was just one (though it did
> >> >> >> live in the Python directory). I'm guessing a copy was made rather
> >> >> >> than just pointing both to the new location, but that completely
> >> >> >> defeats the point. I can't seem to access JIRA right now; could
> >> >> >> someone file an issue to resolve this?
> >> >> >>
> >> >> >> I also think the spec should be next to the definition of the URN,
> >> >> >> that's one of the reason the URNs were originally in a markdown
> file
> >> >> >> (to encourage good documentation, literate programming style).
> Many
> >> >> >> coders already have their specs there.
> >> >> >>
> >> >> >> Regarding backwards compatibility, we can't change existing
> coders,
> >> >> >> and making new coders won't help with inference ('cause changing
> that
> >> >> >> would also be backwards incompatible). Fortunately, I think we're
> >> >> >> already doing the consistent thing here: In both Python and Java
> the
> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested*
> context
> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the
> coder is
> >> >> >> used in a *nested* context.
> >> >> >>
> >> >> >> I'd really like to see the whole nested/unnested context go away,
> but
> >> >> >> that'll probably require Beam 3.0; it causes way more confusion
> than
> >> >> >> the couple of bytes it saves in a couple of places.
> >> >> >>
> >> >> >> - Robert
> >> >> >>
> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com>
> wrote:
> >> >> >> >
> >> >> >> > My 2cents is that the "Textual description" should be part of
> the documentation of the URNs on the Proto messages, since that's the
> common place. I've added a short description for the varints for example,
> and we already have lenghthier format & protocol descriptions there for
> iterables and similar.
> >> >> >> >
> >> >> >> > The proto [1] *can be* the spec if we want it to be.
> >> >> >> >
> >> >> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >> >> >
> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >>>
> >> >> >> >>> We should probably move the "java" version of the yaml file
> [1] to a common location rather than deep in the java hierarchy, or copying
> it for Go and Python, but that can be a separate task. It's probably
> non-trivial since it looks like it's part of a java resources structure.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Seems like /model is a good place for this if we don't want to
> invent a new language-independent hierarchy.
> >> >> >> >>
> >> >> >> >> Kenn
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it
> shouldn't be difficult, given pointers to the Java and Python variants of
> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
> users (such as they are) aren't forced to run them, and not have the yaml
> file to read. I'd suggest putting it with the integration tests [3].
> >> >> >> >>>
> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side
> work. [4]
> >> >> >> >>>
> >> >> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com>
> wrote:
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >>>>>
> >> >> >> >>>>>
> >> >> >> >>>>>
> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
> lcwik@google.com> wrote:
> >> >> >> >>>>>>
> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining
> these formats.
> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >> >> >>>>>
> >> >> >> >>>>>
> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a
> separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8
> coder yet.
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> Not certain as well. I did notice the timer coder definition
> didn't exist in the Python copy.
> >> >> >> >>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>> Here is an example PR[3] that adds the
> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
> interoperability.
> >> >> >> >>>>>>
> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >> >> >>>>>>
> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the
> type -> coder mapping per SDK language would select the non-well known type
> if we added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >> >> >> >>>>>
> >> >> >> >>>>>
> >> >> >> >>>>> I was thinking that may be we should just change the
> default UTF-8 coder for Fn API path which is experimental. Updating Python
> to do what's done for Java is fine if we agree that encoding used for Java
> should be the standard.
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> That is a good idea to use the Fn API experiment to control
> which gets selected.
> >> >> >> >>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >> >> >>>>>>
> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> A URN defines the encoding.
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a
> Coder (defined
> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in
> both Java and
> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length,
> and the
> >> >> >> >>>>>>>> unnested one does not.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I
> only see a URN for UTF-8 [1] while if you look at the implementations Java
> includes length in the encoding [1] while Python [1] does not.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> We should define the spec clearly and have
> cross-language tests.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> +1
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should
> probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> pabloem@google.com> wrote:
> >> >> >> >>>>>>>> >
> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that
> would break pipelines from upgrading? If they have data in-flight in
> between operators, and we change the coder, they would break?
> >> >> >> >>>>>>>> > I know very little about coders, but since nobody has
> mentioned it, I wanted to make sure we have it in mind.
> >> >> >> >>>>>>>> > -P.
> >> >> >> >>>>>>>> >
> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >> >>>>>>>> >>
> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see
> that string UTF-8 was added to the proto enum, but it needs a written spec
> of the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >> >> >>>>>>>> >>
> >> >> >> >>>>>>>> >> Kenn
> >> >> >> >>>>>>>> >>
> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >> >>>>>>>> >>>
> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder
> " URN in the protos, but I don't think that developed beyond Java, so
> adding it to Python would be reasonable in my opinion.
> >> >> >> >>>>>>>> >>>
> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
> presently which for Go are always length prefixed (and reported to the
> Runner as LP+CustomCoder). It would be straight forward to add the correct
> handling for strings, as Go natively treats strings as UTF8.
> >> >> >> >>>>>>>> >>>
> >> >> >> >>>>>>>> >>>
> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> heejong@google.com> wrote:
> >> >> >> >>>>>>>> >>>>
> >> >> >> >>>>>>>> >>>> Hi all,
> >> >> >> >>>>>>>> >>>>
> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python
> SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the
> varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >> >> >>>>>>>> >>>>
> >> >> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>
>
> On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > Nested and unnested contexts are two different encodings. Can we just give them different URNs? We can even just express the length-prefixed UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>>
>> It's not that simple, especially when it comes to composite encodings.
>> E.g. for some coders, nested(C) == unnested(C), for some coders
>> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> something else altogether (e.g. when creating a kv coder, the first
>> component must use nested context, and the second inherits the nested
>> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> may themselves be large and incrementally written (e.g.
>> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> free to compose and decompose tuple, iterable, etc. coders that it
>> doesn't understand.
>>
>> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> we will continue to need the unnested context (lest we expect and
>> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> On the other hand, almost all internal use is nested (due to sending
>> elements around as part of element streams). The other place we cross
>> over is LengthPrefixCoder that encodes its values using the unnested
>> context prefixed by the unnested encoding length.
>>
>> Perhaps a step in the right direction would be to consistently use the
>> unnested context everywhere but IOs (meaning when we talked about
>> coders from the FnAPI perspective, they're *always* in the nested
>> context, and hence always have the one and only encoding defined by
>> that URN, including when wrapped by a length prefix coder (which would
>> sometimes result in double length prefixing, but I think that's a
>> price worth paying (or we could do something more clever like make
>> length-prefix an (explicit) modifier on a coder rather than a new
>> coder itself that would default to length prefixing (or some of the
>> other delimiting schemes we've come up with) but allow the component
>> coder to offer alternative length-prefix-compatible encodings))). IOs
>> could be updated to take Parsers and Formatters (or whatever we call
>> them) with the Coders in the constructors left as syntactic sugar
>> until we could remove them in 3.0. As Luke says, we have a chance to
>> fix our coders for portable pipelines now.
>>
>> In the (very) short term, we're stuck with a nested and unnested
>> version of StringUtf8, just as we have for bytes, lest we change the
>> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>
>
> First, let's note that "nested" and "outer" are a misnomer. The distinction is whether it is the last thing encoded in the stream. In a KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the "outer" context though the value is nested. No doubt a good amount of confusion comes from the initial and continued use of this terminology.

+1. I think of these as "self-delimiting" vs. "externally-delimited."

> So, all that said, it is a simple fact that UTF-8 and length-prefixed UTF-8 are two different encodings. Encodings are the fundamental concept here and coders encapsulate two encodings, with some subtle and inconsistently-applied rules about when to use which encoding. I think we should still give them distinct URNs unless impossible. You've outlined some steps to clarify the situation.

Currently, we have a one-to-two relationship between Coders and
encodings, and a one-to-one relationship between URNs and encoders.

To make the first one-to-one, we would either have to make
StringUtf8Coder unsuitable for TextIO (letting it always prefix its
contents with a length) or unsuitable for the key part of a KV, the
element of an iterable, etc. (where the length is required).
Alternatively we could give Coders the ability to return the
nested/unnested version of themselves, but this also gets messy
because it depends on the ultimate outer context which we don't always
have at hand (and leads to surprises, e.g. asking for the key coder of
a KV coder may not return the same coder that was used to construct
it).

On the other hand, breaking the one-to-one relationship of Coders and
URNs is also undesirable, because it forces a choice when serializing
to protos (where one may not always know all the contexts it's used
in) and makes the round-trip through protos non-idempotent (if the URN
gets decoded back into a Coder that does not have this dual-defined
encoding). Also, if a runner knows about special pairs of URNs that
represent the nested/unnested encodings of the same type, and rules
like knowing constraints on components of KVs, etc., this seems an
even worse situation than we're already in.

> I think the most meaningful issue is runners manipulating coders. But the runner should be able to instruct the SDK (where the coder actually executes) about which encoding to use. I need to think through some examples of where the SDK tells the runner the encoding of something and those where the runner fabricates steps and instructs the SDK what encoding to use for elements. GroupByKey, GetKeys, etc, are examples where the context will change.

Combiner lifting and side inputs are good candidates as well.

>> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com> wrote:
>> >>
>> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>> >> >
>> >> > Robert, does nested/unnested context work properly for Java?
>> >>
>> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> >> based on the context.
>> >>
>> >> > I can see that the Context is fixed to NESTED[1] and the encode method with the Context parameter is marked as deprecated[2].
>> >> >
>> >> > [1]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> >> > [2]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>> >>
>> >> That doesn't mean it's unused, e.g.
>> >>
>> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>> >>
>> >> (and I'm sure there's others).
>> >>
>> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com> wrote:
>> >> >>
>> >> >> I don't know why there are two separate copies of
>> >> >> standard_coders.yaml--originally there was just one (though it did
>> >> >> live in the Python directory). I'm guessing a copy was made rather
>> >> >> than just pointing both to the new location, but that completely
>> >> >> defeats the point. I can't seem to access JIRA right now; could
>> >> >> someone file an issue to resolve this?
>> >> >>
>> >> >> I also think the spec should be next to the definition of the URN,
>> >> >> that's one of the reason the URNs were originally in a markdown file
>> >> >> (to encourage good documentation, literate programming style). Many
>> >> >> coders already have their specs there.
>> >> >>
>> >> >> Regarding backwards compatibility, we can't change existing coders,
>> >> >> and making new coders won't help with inference ('cause changing that
>> >> >> would also be backwards incompatible). Fortunately, I think we're
>> >> >> already doing the consistent thing here: In both Python and Java the
>> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> >> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> >> >> used in a *nested* context.
>> >> >>
>> >> >> I'd really like to see the whole nested/unnested context go away, but
>> >> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> >> the couple of bytes it saves in a couple of places.
>> >> >>
>> >> >> - Robert
>> >> >>
>> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >
>> >> >> > My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar.
>> >> >> >
>> >> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >> >
>> >> >> > [1]: https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >> >
>> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >>>
>> >> >> >>> We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure.
>> >> >> >>
>> >> >> >>
>> >> >> >> Seems like /model is a good place for this if we don't want to invent a new language-independent hierarchy.
>> >> >> >>
>> >> >> >> Kenn
>> >> >> >>
>> >> >> >>
>> >> >> >>>
>> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be difficult, given pointers to the Java and Python variants of the tests to crib from [2]. Care would need to be taken so that Beam Go SDK users (such as they are) aren't forced to run them, and not have the yaml file to read. I'd suggest putting it with the integration tests [3].
>> >> >> >>>
>> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >> >> >>>
>> >> >> >>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >>> 2: https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining these formats.
>> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy for Python ? I didn't see a significant difference in definitions looking at few random coders there but I might have missed something. If there's no reason to maintain two, we should probably unify.
>> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> Not certain as well. I did notice the timer coder definition didn't exist in the Python copy.
>> >> >> >>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability.
>> >> >> >>>>>>
>> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses standard_coders.yaml and runs compatibility tests?
>> >> >> >>>>>>
>> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the type -> coder mapping per SDK language would select the non-well known type if we added a new one to a language. If we swapped the default type->coder mapping, this would still break update for pipelines forcing users to update their code to select the non-well known type. If we don't change the default type->coder mapping, the well known coder will gain little usage. I think we should fix the Python coder to use the same encoding as Java for UTF-8 strings before there are too many Python SDK users.
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> I was thinking that may be we should just change the default UTF-8 coder for Fn API path which is experimental. Updating Python to do what's done for Java is fine if we agree that encoding used for Java should be the standard.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> That is a good idea to use the Fn API experiment to control which gets selected.
>> >> >> >>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >>>>>> 2: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >> >>>>>>
>> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> A URN defines the encoding.
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >> >> >>>>>>>> unnested one does not.
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I only see a URN for UTF-8 [1] while if you look at the implementations Java includes length in the encoding [1] while Python [1] does not.
>> >> >> >>>>>>>
>> >> >> >>>>>>> [1] https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >> >>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >> >>>>>>> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> We should define the spec clearly and have cross-language tests.
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> +1
>> >> >> >>>>>>>
>> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should probably not update existing coder classes. Probably we should just standardize the correct encoding (may be as a comment near corresponding URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >> >> >>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
>> >> >> >>>>>>>> > I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
>> >> >> >>>>>>>> > -P.
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> Kenn
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> Hi all,
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <ro...@google.com> wrote:

> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > Nested and unnested contexts are two different encodings. Can we just
> give them different URNs? We can even just express the length-prefixed
> UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>
> It's not that simple, especially when it comes to composite encodings.
> E.g. for some coders, nested(C) == unnested(C), for some coders
> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
> something else altogether (e.g. when creating a kv coder, the first
> component must use nested context, and the second inherits the nested
> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
> may themselves be large and incrementally written (e.g.
> IterableCoder). Using distinct URNs doesn't work well if the runner is
> free to compose and decompose tuple, iterable, etc. coders that it
> doesn't understand.
>
> Until we stop using Coders for IO (a worthy but probably lofty goal)
> we will continue to need the unnested context (lest we expect and
> produce length-prefixed coders in text files, as bigtable keys, etc.).
> On the other hand, almost all internal use is nested (due to sending
> elements around as part of element streams). The other place we cross
> over is LengthPrefixCoder that encodes its values using the unnested
> context prefixed by the unnested encoding length.
>
> Perhaps a step in the right direction would be to consistently use the
> unnested context everywhere but IOs (meaning when we talked about
> coders from the FnAPI perspective, they're *always* in the nested
> context, and hence always have the one and only encoding defined by
> that URN, including when wrapped by a length prefix coder (which would
> sometimes result in double length prefixing, but I think that's a
> price worth paying (or we could do something more clever like make
> length-prefix an (explicit) modifier on a coder rather than a new
> coder itself that would default to length prefixing (or some of the
> other delimiting schemes we've come up with) but allow the component
> coder to offer alternative length-prefix-compatible encodings))). IOs
> could be updated to take Parsers and Formatters (or whatever we call
> them) with the Coders in the constructors left as syntactic sugar
> until we could remove them in 3.0. As Luke says, we have a chance to
> fix our coders for portable pipelines now.
>
> In the (very) short term, we're stuck with a nested and unnested
> version of StringUtf8, just as we have for bytes, lest we change the
> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>

First, let's note that "nested" and "outer" are a misnomer. The distinction
is whether it is the last thing encoded in the stream. In a
KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the
"outer" context though the value is nested. No doubt a good amount of
confusion comes from the initial and continued use of this terminology.

So, all that said, it is a simple fact that UTF-8 and length-prefixed UTF-8
are two different encodings. Encodings are the fundamental concept here and
coders encapsulate two encodings, with some subtle and
inconsistently-applied rules about when to use which encoding. I think we
should still give them distinct URNs unless impossible. You've outlined
some steps to clarify the situation.

I think the most meaningful issue is runners manipulating coders. But the
runner should be able to instruct the SDK (where the coder actually
executes) about which encoding to use. I need to think through some
examples of where the SDK tells the runner the encoding of something and
those where the runner fabricates steps and instructs the SDK what encoding
to use for elements. GroupByKey, GetKeys, etc, are examples where the
context will change.

Kenn


>
>
> >
> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
> >> >
> >> > Robert, does nested/unnested context work properly for Java?
> >>
> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
> >> based on the context.
> >>
> >> > I can see that the Context is fixed to NESTED[1] and the encode
> method with the Context parameter is marked as deprecated[2].
> >> >
> >> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> >> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
> >>
> >> That doesn't mean it's unused, e.g.
> >>
> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
> >>
> >> (and I'm sure there's others).
> >>
> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >> >>
> >> >> I don't know why there are two separate copies of
> >> >> standard_coders.yaml--originally there was just one (though it did
> >> >> live in the Python directory). I'm guessing a copy was made rather
> >> >> than just pointing both to the new location, but that completely
> >> >> defeats the point. I can't seem to access JIRA right now; could
> >> >> someone file an issue to resolve this?
> >> >>
> >> >> I also think the spec should be next to the definition of the URN,
> >> >> that's one of the reason the URNs were originally in a markdown file
> >> >> (to encourage good documentation, literate programming style). Many
> >> >> coders already have their specs there.
> >> >>
> >> >> Regarding backwards compatibility, we can't change existing coders,
> >> >> and making new coders won't help with inference ('cause changing that
> >> >> would also be backwards incompatible). Fortunately, I think we're
> >> >> already doing the consistent thing here: In both Python and Java the
> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested*
> context
> >> >> and the length-prefixed UTF-8 encoded bytes are used when the coder
> is
> >> >> used in a *nested* context.
> >> >>
> >> >> I'd really like to see the whole nested/unnested context go away, but
> >> >> that'll probably require Beam 3.0; it causes way more confusion than
> >> >> the couple of bytes it saves in a couple of places.
> >> >>
> >> >> - Robert
> >> >>
> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com>
> wrote:
> >> >> >
> >> >> > My 2cents is that the "Textual description" should be part of the
> documentation of the URNs on the Proto messages, since that's the common
> place. I've added a short description for the varints for example, and we
> already have lenghthier format & protocol descriptions there for iterables
> and similar.
> >> >> >
> >> >> > The proto [1] *can be* the spec if we want it to be.
> >> >> >
> >> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >> >
> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com>
> wrote:
> >> >> >>>
> >> >> >>> We should probably move the "java" version of the yaml file [1]
> to a common location rather than deep in the java hierarchy, or copying it
> for Go and Python, but that can be a separate task. It's probably
> non-trivial since it looks like it's part of a java resources structure.
> >> >> >>
> >> >> >>
> >> >> >> Seems like /model is a good place for this if we don't want to
> invent a new language-independent hierarchy.
> >> >> >>
> >> >> >> Kenn
> >> >> >>
> >> >> >>
> >> >> >>>
> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it
> shouldn't be difficult, given pointers to the Java and Python variants of
> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
> users (such as they are) aren't forced to run them, and not have the yaml
> file to read. I'd suggest putting it with the integration tests [3].
> >> >> >>>
> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side
> work. [4]
> >> >> >>>
> >> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >> >>>
> >> >> >>>
> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com>
> wrote:
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com>
> wrote:
> >> >> >>>>>>
> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining
> these formats.
> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a
> separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8
> coder yet.
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> Not certain as well. I did notice the timer coder definition
> didn't exist in the Python copy.
> >> >> >>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1"
> as tests to the Java and Python SDKs to ensure interoperability.
> >> >> >>>>>>
> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >> >>>>>>
> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the
> type -> coder mapping per SDK language would select the non-well known type
> if we added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> I was thinking that may be we should just change the default
> UTF-8 coder for Fn API path which is experimental. Updating Python to do
> what's done for Java is fine if we agree that encoding used for Java should
> be the standard.
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> That is a good idea to use the Fn API experiment to control
> which gets selected.
> >> >> >>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >> >>>>>>
> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >> >>>>>>>>
> >> >> >>>>>>>> A URN defines the encoding.
> >> >> >>>>>>>>
> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a
> Coder (defined
> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both
> Java and
> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and
> the
> >> >> >>>>>>>> unnested one does not.
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I
> only see a URN for UTF-8 [1] while if you look at the implementations Java
> includes length in the encoding [1] while Python [1] does not.
> >> >> >>>>>>>
> >> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> We should define the spec clearly and have cross-language
> tests.
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>> +1
> >> >> >>>>>>>
> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should
> probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >> >>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> pabloem@google.com> wrote:
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > Could this be a backwards-incompatible change that would
> break pipelines from upgrading? If they have data in-flight in between
> operators, and we change the coder, they would break?
> >> >> >>>>>>>> > I know very little about coders, but since nobody has
> mentioned it, I wanted to make sure we have it in mind.
> >> >> >>>>>>>> > -P.
> >> >> >>>>>>>> >
> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that
> string UTF-8 was added to the proto enum, but it needs a written spec of
> the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> Kenn
> >> >> >>>>>>>> >>
> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >> >>>>>>>> >>>
> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder "
> URN in the protos, but I don't think that developed beyond Java, so adding
> it to Python would be reasonable in my opinion.
> >> >> >>>>>>>> >>>
> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently
> which for Go are always length prefixed (and reported to the Runner as
> LP+CustomCoder). It would be straight forward to add the correct handling
> for strings, as Go natively treats strings as UTF8.
> >> >> >>>>>>>> >>>
> >> >> >>>>>>>> >>>
> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> heejong@google.com> wrote:
> >> >> >>>>>>>> >>>>
> >> >> >>>>>>>> >>>> Hi all,
> >> >> >>>>>>>> >>>>
> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python
> SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the
> varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >> >>>>>>>> >>>>
> >> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> Nested and unnested contexts are two different encodings. Can we just give them different URNs? We can even just express the length-prefixed UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.

It's not that simple, especially when it comes to composite encodings.
E.g. for some coders, nested(C) == unnested(C), for some coders
nested(C) == lenth_prefix(unnested(C)), and for other coders it's
something else altogether (e.g. when creating a kv coder, the first
component must use nested context, and the second inherits the nested
vs. unnested context). When creating TupleCoder(A, B) one doesn't want
to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
may themselves be large and incrementally written (e.g.
IterableCoder). Using distinct URNs doesn't work well if the runner is
free to compose and decompose tuple, iterable, etc. coders that it
doesn't understand.

Until we stop using Coders for IO (a worthy but probably lofty goal)
we will continue to need the unnested context (lest we expect and
produce length-prefixed coders in text files, as bigtable keys, etc.).
On the other hand, almost all internal use is nested (due to sending
elements around as part of element streams). The other place we cross
over is LengthPrefixCoder that encodes its values using the unnested
context prefixed by the unnested encoding length.

Perhaps a step in the right direction would be to consistently use the
unnested context everywhere but IOs (meaning when we talked about
coders from the FnAPI perspective, they're *always* in the nested
context, and hence always have the one and only encoding defined by
that URN, including when wrapped by a length prefix coder (which would
sometimes result in double length prefixing, but I think that's a
price worth paying (or we could do something more clever like make
length-prefix an (explicit) modifier on a coder rather than a new
coder itself that would default to length prefixing (or some of the
other delimiting schemes we've come up with) but allow the component
coder to offer alternative length-prefix-compatible encodings))). IOs
could be updated to take Parsers and Formatters (or whatever we call
them) with the Coders in the constructors left as syntactic sugar
until we could remove them in 3.0. As Luke says, we have a chance to
fix our coders for portable pipelines now.

In the (very) short term, we're stuck with a nested and unnested
version of StringUtf8, just as we have for bytes, lest we change the
meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.


>
> On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>> >
>> > Robert, does nested/unnested context work properly for Java?
>>
>> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> based on the context.
>>
>> > I can see that the Context is fixed to NESTED[1] and the encode method with the Context parameter is marked as deprecated[2].
>> >
>> > [1]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> > [2]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>>
>> That doesn't mean it's unused, e.g.
>>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>>
>> (and I'm sure there's others).
>>
>> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com> wrote:
>> >>
>> >> I don't know why there are two separate copies of
>> >> standard_coders.yaml--originally there was just one (though it did
>> >> live in the Python directory). I'm guessing a copy was made rather
>> >> than just pointing both to the new location, but that completely
>> >> defeats the point. I can't seem to access JIRA right now; could
>> >> someone file an issue to resolve this?
>> >>
>> >> I also think the spec should be next to the definition of the URN,
>> >> that's one of the reason the URNs were originally in a markdown file
>> >> (to encourage good documentation, literate programming style). Many
>> >> coders already have their specs there.
>> >>
>> >> Regarding backwards compatibility, we can't change existing coders,
>> >> and making new coders won't help with inference ('cause changing that
>> >> would also be backwards incompatible). Fortunately, I think we're
>> >> already doing the consistent thing here: In both Python and Java the
>> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> >> used in a *nested* context.
>> >>
>> >> I'd really like to see the whole nested/unnested context go away, but
>> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> the couple of bytes it saves in a couple of places.
>> >>
>> >> - Robert
>> >>
>> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >
>> >> > My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar.
>> >> >
>> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >
>> >> > [1]: https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >
>> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >>>
>> >> >>> We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure.
>> >> >>
>> >> >>
>> >> >> Seems like /model is a good place for this if we don't want to invent a new language-independent hierarchy.
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be difficult, given pointers to the Java and Python variants of the tests to crib from [2]. Care would need to be taken so that Beam Go SDK users (such as they are) aren't forced to run them, and not have the yaml file to read. I'd suggest putting it with the integration tests [3].
>> >> >>>
>> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >> >>>
>> >> >>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >>> 2: https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >>>
>> >> >>>
>> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>> >> >>>>>>
>> >> >>>>>> standard_coders.yaml[1] is where we are currently defining these formats.
>> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy for Python ? I didn't see a significant difference in definitions looking at few random coders there but I might have missed something. If there's no reason to maintain two, we should probably unify.
>> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>> >> >>>>>
>> >> >>>>
>> >> >>>> Not certain as well. I did notice the timer coder definition didn't exist in the Python copy.
>> >> >>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability.
>> >> >>>>>>
>> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses standard_coders.yaml and runs compatibility tests?
>> >> >>>>>>
>> >> >>>>>> Chamikara, creating new coder classes is a pain since the type -> coder mapping per SDK language would select the non-well known type if we added a new one to a language. If we swapped the default type->coder mapping, this would still break update for pipelines forcing users to update their code to select the non-well known type. If we don't change the default type->coder mapping, the well known coder will gain little usage. I think we should fix the Python coder to use the same encoding as Java for UTF-8 strings before there are too many Python SDK users.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> I was thinking that may be we should just change the default UTF-8 coder for Fn API path which is experimental. Updating Python to do what's done for Java is fine if we agree that encoding used for Java should be the standard.
>> >> >>>>>
>> >> >>>>
>> >> >>>> That is a good idea to use the Fn API experiment to control which gets selected.
>> >> >>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >>>>>> 2: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >>>>>>
>> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com> wrote:
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> A URN defines the encoding.
>> >> >>>>>>>>
>> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >> >>>>>>>> unnested one does not.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Could you clarify where we define the exact encoding ? I only see a URN for UTF-8 [1] while if you look at the implementations Java includes length in the encoding [1] while Python [1] does not.
>> >> >>>>>>>
>> >> >>>>>>> [1] https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >>>>>>> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> We should define the spec clearly and have cross-language tests.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> +1
>> >> >>>>>>>
>> >> >>>>>>> Regarding backwards compatibility, I agree that we should probably not update existing coder classes. Probably we should just standardize the correct encoding (may be as a comment near corresponding URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >> >>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >> >>>>>>>> >
>> >> >>>>>>>> > Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
>> >> >>>>>>>> > I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
>> >> >>>>>>>> > -P.
>> >> >>>>>>>> >
>> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> Kenn
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> Hi all,
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Lukasz Cwik <lc...@google.com>.

Also, as for the backwards compatibility discussion, I don't believe
non-portable jobs will be able to be upgraded to portable jobs and hence
may be a good time to make upgrade incompatible coder changes at that point
in time.

On Fri, Apr 5, 2019 at 1:44 PM Lukasz Cwik <lc...@google.com> wrote:

> Robert, I filed https://issues.apache.org/jira/browse/BEAM-7015 for
> removing the Python SDK copy of standard_coders.yaml and assigned it to you.
>
> On Fri, Apr 5, 2019 at 9:24 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Nested and unnested contexts are two different encodings. Can we just
>> give them different URNs? We can even just express the length-prefixed
>> UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>>
>> On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>>> >
>>> > Robert, does nested/unnested context work properly for Java?
>>>
>>> I believe so. It is similar to the bytes coder, that prefixes vs. not
>>> based on the context.
>>>
>>> > I can see that the Context is fixed to NESTED[1] and the encode method
>>> with the Context parameter is marked as deprecated[2].
>>> >
>>> > [1]:
>>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>>> > [2]:
>>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>>>
>>> That doesn't mean it's unused, e.g.
>>>
>>>
>>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>>>
>>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>>>
>>> (and I'm sure there's others).
>>>
>>> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>> >>
>>> >> I don't know why there are two separate copies of
>>> >> standard_coders.yaml--originally there was just one (though it did
>>> >> live in the Python directory). I'm guessing a copy was made rather
>>> >> than just pointing both to the new location, but that completely
>>> >> defeats the point. I can't seem to access JIRA right now; could
>>> >> someone file an issue to resolve this?
>>> >>
>>> >> I also think the spec should be next to the definition of the URN,
>>> >> that's one of the reason the URNs were originally in a markdown file
>>> >> (to encourage good documentation, literate programming style). Many
>>> >> coders already have their specs there.
>>> >>
>>> >> Regarding backwards compatibility, we can't change existing coders,
>>> >> and making new coders won't help with inference ('cause changing that
>>> >> would also be backwards incompatible). Fortunately, I think we're
>>> >> already doing the consistent thing here: In both Python and Java the
>>> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>>> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>>> >> used in a *nested* context.
>>> >>
>>> >> I'd really like to see the whole nested/unnested context go away, but
>>> >> that'll probably require Beam 3.0; it causes way more confusion than
>>> >> the couple of bytes it saves in a couple of places.
>>> >>
>>> >> - Robert
>>> >>
>>> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com>
>>> wrote:
>>> >> >
>>> >> > My 2cents is that the "Textual description" should be part of the
>>> documentation of the URNs on the Proto messages, since that's the common
>>> place. I've added a short description for the varints for example, and we
>>> already have lenghthier format & protocol descriptions there for iterables
>>> and similar.
>>> >> >
>>> >> > The proto [1] *can be* the spec if we want it to be.
>>> >> >
>>> >> > [1]:
>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>>> >> >
>>> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com>
>>> wrote:
>>> >> >>>
>>> >> >>> We should probably move the "java" version of the yaml file [1]
>>> to a common location rather than deep in the java hierarchy, or copying it
>>> for Go and Python, but that can be a separate task. It's probably
>>> non-trivial since it looks like it's part of a java resources structure.
>>> >> >>
>>> >> >>
>>> >> >> Seems like /model is a good place for this if we don't want to
>>> invent a new language-independent hierarchy.
>>> >> >>
>>> >> >> Kenn
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> Luke, the Go SDK doesn't currently do this validation, but it
>>> shouldn't be difficult, given pointers to the Java and Python variants of
>>> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
>>> users (such as they are) aren't forced to run them, and not have the yaml
>>> file to read. I'd suggest putting it with the integration tests [3].
>>> >> >>>
>>> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work.
>>> [4]
>>> >> >>>
>>> >> >>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> >> >>> 2:
>>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>>> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>>> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com>
>>> wrote:
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
>>> chamikara@google.com> wrote:
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com>
>>> wrote:
>>> >> >>>>>>
>>> >> >>>>>> standard_coders.yaml[1] is where we are currently defining
>>> these formats.
>>> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a
>>> separate copy for Python ? I didn't see a significant difference in
>>> definitions looking at few random coders there but I might have missed
>>> something. If there's no reason to maintain two, we should probably unify.
>>> >> >>>>> Also, seems like we haven't added the definition for UTF-8
>>> coder yet.
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Not certain as well. I did notice the timer coder definition
>>> didn't exist in the Python copy.
>>> >> >>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1"
>>> as tests to the Java and Python SDKs to ensure interoperability.
>>> >> >>>>>>
>>> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
>>> standard_coders.yaml and runs compatibility tests?
>>> >> >>>>>>
>>> >> >>>>>> Chamikara, creating new coder classes is a pain since the type
>>> -> coder mapping per SDK language would select the non-well known type if
>>> we added a new one to a language. If we swapped the default type->coder
>>> mapping, this would still break update for pipelines forcing users to
>>> update their code to select the non-well known type. If we don't change the
>>> default type->coder mapping, the well known coder will gain little usage. I
>>> think we should fix the Python coder to use the same encoding as Java for
>>> UTF-8 strings before there are too many Python SDK users.
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> I was thinking that may be we should just change the default
>>> UTF-8 coder for Fn API path which is experimental. Updating Python to do
>>> what's done for Java is fine if we agree that encoding used for Java should
>>> be the standard.
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> That is a good idea to use the Fn API experiment to control
>>> which gets selected.
>>> >> >>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> >> >>>>>> 2:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>>> >> >>>>>>
>>> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>> chamikara@google.com> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> A URN defines the encoding.
>>> >> >>>>>>>>
>>> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a
>>> Coder (defined
>>> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both
>>> Java and
>>> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and
>>> the
>>> >> >>>>>>>> unnested one does not.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>> Could you clarify where we define the exact encoding ? I only
>>> see a URN for UTF-8 [1] while if you look at the implementations Java
>>> includes length in the encoding [1] while Python [1] does not.
>>> >> >>>>>>>
>>> >> >>>>>>> [1]
>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>> >> >>>>>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>> >> >>>>>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> We should define the spec clearly and have cross-language
>>> tests.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>> +1
>>> >> >>>>>>>
>>> >> >>>>>>> Regarding backwards compatibility, I agree that we should
>>> probably not update existing coder classes. Probably we should just
>>> standardize the correct encoding (may be as a comment near corresponding
>>> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>>> >> >>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
>>> pabloem@google.com> wrote:
>>> >> >>>>>>>> >
>>> >> >>>>>>>> > Could this be a backwards-incompatible change that would
>>> break pipelines from upgrading? If they have data in-flight in between
>>> operators, and we change the coder, they would break?
>>> >> >>>>>>>> > I know very little about coders, but since nobody has
>>> mentioned it, I wanted to make sure we have it in mind.
>>> >> >>>>>>>> > -P.
>>> >> >>>>>>>> >
>>> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
>>> kenn@apache.org> wrote:
>>> >> >>>>>>>> >>
>>> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that
>>> string UTF-8 was added to the proto enum, but it needs a written spec of
>>> the encoding. Ideally some test data that different languages can use to
>>> drive compliance testing.
>>> >> >>>>>>>> >>
>>> >> >>>>>>>> >> Kenn
>>> >> >>>>>>>> >>
>>> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
>>> robert@frantil.com> wrote:
>>> >> >>>>>>>> >>>
>>> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder "
>>> URN in the protos, but I don't think that developed beyond Java, so adding
>>> it to Python would be reasonable in my opinion.
>>> >> >>>>>>>> >>>
>>> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently
>>> which for Go are always length prefixed (and reported to the Runner as
>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>> for strings, as Go natively treats strings as UTF8.
>>> >> >>>>>>>> >>>
>>> >> >>>>>>>> >>>
>>> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
>>> heejong@google.com> wrote:
>>> >> >>>>>>>> >>>>
>>> >> >>>>>>>> >>>> Hi all,
>>> >> >>>>>>>> >>>>
>>> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python
>>> SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the
>>> varint length of the input string before actual data bytes however
>>> StrUtf8Coder in Python SDK directly encodes the input string to bytes
>>> value. For the last few weeks, I've been testing and fixing cross-language
>>> IO transforms and this discrepancy is a major blocker for me. IMO, we
>>> should unify the encoding schemes of UTF8 strings across the different SDKs
>>> and make it a standard coder. Any thoughts?
>>> >> >>>>>>>> >>>>
>>> >> >>>>>>>> >>>> Thanks,
>>>
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Lukasz Cwik <lc...@google.com>.

Robert, I filed https://issues.apache.org/jira/browse/BEAM-7015 for
removing the Python SDK copy of standard_coders.yaml and assigned it to you.

On Fri, Apr 5, 2019 at 9:24 AM Kenneth Knowles <ke...@apache.org> wrote:

> Nested and unnested contexts are two different encodings. Can we just give
> them different URNs? We can even just express the length-prefixed UTF-8 as
> a composition of the length-prefix URN and the UTF-8 URN.
>
> On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>> >
>> > Robert, does nested/unnested context work properly for Java?
>>
>> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> based on the context.
>>
>> > I can see that the Context is fixed to NESTED[1] and the encode method
>> with the Context parameter is marked as deprecated[2].
>> >
>> > [1]:
>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> > [2]:
>> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>>
>> That doesn't mean it's unused, e.g.
>>
>>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>>
>> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>>
>> (and I'm sure there's others).
>>
>> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>
>> >> I don't know why there are two separate copies of
>> >> standard_coders.yaml--originally there was just one (though it did
>> >> live in the Python directory). I'm guessing a copy was made rather
>> >> than just pointing both to the new location, but that completely
>> >> defeats the point. I can't seem to access JIRA right now; could
>> >> someone file an issue to resolve this?
>> >>
>> >> I also think the spec should be next to the definition of the URN,
>> >> that's one of the reason the URNs were originally in a markdown file
>> >> (to encourage good documentation, literate programming style). Many
>> >> coders already have their specs there.
>> >>
>> >> Regarding backwards compatibility, we can't change existing coders,
>> >> and making new coders won't help with inference ('cause changing that
>> >> would also be backwards incompatible). Fortunately, I think we're
>> >> already doing the consistent thing here: In both Python and Java the
>> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> >> used in a *nested* context.
>> >>
>> >> I'd really like to see the whole nested/unnested context go away, but
>> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> the couple of bytes it saves in a couple of places.
>> >>
>> >> - Robert
>> >>
>> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com>
>> wrote:
>> >> >
>> >> > My 2cents is that the "Textual description" should be part of the
>> documentation of the URNs on the Proto messages, since that's the common
>> place. I've added a short description for the varints for example, and we
>> already have lenghthier format & protocol descriptions there for iterables
>> and similar.
>> >> >
>> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >
>> >> > [1]:
>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >
>> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com>
>> wrote:
>> >> >>>
>> >> >>> We should probably move the "java" version of the yaml file [1] to
>> a common location rather than deep in the java hierarchy, or copying it for
>> Go and Python, but that can be a separate task. It's probably non-trivial
>> since it looks like it's part of a java resources structure.
>> >> >>
>> >> >>
>> >> >> Seems like /model is a good place for this if we don't want to
>> invent a new language-independent hierarchy.
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> Luke, the Go SDK doesn't currently do this validation, but it
>> shouldn't be difficult, given pointers to the Java and Python variants of
>> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
>> users (such as they are) aren't forced to run them, and not have the yaml
>> file to read. I'd suggest putting it with the integration tests [3].
>> >> >>>
>> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work.
>> [4]
>> >> >>>
>> >> >>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >>> 2:
>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >>>
>> >> >>>
>> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
>> chamikara@google.com> wrote:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com>
>> wrote:
>> >> >>>>>>
>> >> >>>>>> standard_coders.yaml[1] is where we are currently defining
>> these formats.
>> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a
>> separate copy for Python ? I didn't see a significant difference in
>> definitions looking at few random coders there but I might have missed
>> something. If there's no reason to maintain two, we should probably unify.
>> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder
>> yet.
>> >> >>>>>
>> >> >>>>
>> >> >>>> Not certain as well. I did notice the timer coder definition
>> didn't exist in the Python copy.
>> >> >>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1"
>> as tests to the Java and Python SDKs to ensure interoperability.
>> >> >>>>>>
>> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
>> standard_coders.yaml and runs compatibility tests?
>> >> >>>>>>
>> >> >>>>>> Chamikara, creating new coder classes is a pain since the type
>> -> coder mapping per SDK language would select the non-well known type if
>> we added a new one to a language. If we swapped the default type->coder
>> mapping, this would still break update for pipelines forcing users to
>> update their code to select the non-well known type. If we don't change the
>> default type->coder mapping, the well known coder will gain little usage. I
>> think we should fix the Python coder to use the same encoding as Java for
>> UTF-8 strings before there are too many Python SDK users.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> I was thinking that may be we should just change the default
>> UTF-8 coder for Fn API path which is experimental. Updating Python to do
>> what's done for Java is fine if we agree that encoding used for Java should
>> be the standard.
>> >> >>>>>
>> >> >>>>
>> >> >>>> That is a good idea to use the Fn API experiment to control which
>> gets selected.
>> >> >>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >>>>>> 2:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >>>>>>
>> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>> chamikara@google.com> wrote:
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> A URN defines the encoding.
>> >> >>>>>>>>
>> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
>> (defined
>> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both
>> Java and
>> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >> >>>>>>>> unnested one does not.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Could you clarify where we define the exact encoding ? I only
>> see a URN for UTF-8 [1] while if you look at the implementations Java
>> includes length in the encoding [1] while Python [1] does not.
>> >> >>>>>>>
>> >> >>>>>>> [1]
>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >>>>>>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >>>>>>> [3]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> We should define the spec clearly and have cross-language
>> tests.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> +1
>> >> >>>>>>>
>> >> >>>>>>> Regarding backwards compatibility, I agree that we should
>> probably not update existing coder classes. Probably we should just
>> standardize the correct encoding (may be as a comment near corresponding
>> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >> >>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
>> pabloem@google.com> wrote:
>> >> >>>>>>>> >
>> >> >>>>>>>> > Could this be a backwards-incompatible change that would
>> break pipelines from upgrading? If they have data in-flight in between
>> operators, and we change the coder, they would break?
>> >> >>>>>>>> > I know very little about coders, but since nobody has
>> mentioned it, I wanted to make sure we have it in mind.
>> >> >>>>>>>> > -P.
>> >> >>>>>>>> >
>> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
>> kenn@apache.org> wrote:
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that
>> string UTF-8 was added to the proto enum, but it needs a written spec of
>> the encoding. Ideally some test data that different languages can use to
>> drive compliance testing.
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> Kenn
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
>> robert@frantil.com> wrote:
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN
>> in the protos, but I don't think that developed beyond Java, so adding it
>> to Python would be reasonable in my opinion.
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently
>> which for Go are always length prefixed (and reported to the Runner as
>> LP+CustomCoder). It would be straight forward to add the correct handling
>> for strings, as Go natively treats strings as UTF8.
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>>
>> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
>> heejong@google.com> wrote:
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> Hi all,
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs
>> uses different encoding schemes. StringUtf8Coder in Java SDK puts the
>> varint length of the input string before actual data bytes however
>> StrUtf8Coder in Python SDK directly encodes the input string to bytes
>> value. For the last few weeks, I've been testing and fixing cross-language
>> IO transforms and this discrepancy is a major blocker for me. IMO, we
>> should unify the encoding schemes of UTF8 strings across the different SDKs
>> and make it a standard coder. Any thoughts?
>> >> >>>>>>>> >>>>
>> >> >>>>>>>> >>>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

Nested and unnested contexts are two different encodings. Can we just give
them different URNs? We can even just express the length-prefixed UTF-8 as
a composition of the length-prefix URN and the UTF-8 URN.

On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <ro...@google.com> wrote:

> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
> >
> > Robert, does nested/unnested context work properly for Java?
>
> I believe so. It is similar to the bytes coder, that prefixes vs. not
> based on the context.
>
> > I can see that the Context is fixed to NESTED[1] and the encode method
> with the Context parameter is marked as deprecated[2].
> >
> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>
> That doesn't mean it's unused, e.g.
>
>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>
> (and I'm sure there's others).
>
> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> I don't know why there are two separate copies of
> >> standard_coders.yaml--originally there was just one (though it did
> >> live in the Python directory). I'm guessing a copy was made rather
> >> than just pointing both to the new location, but that completely
> >> defeats the point. I can't seem to access JIRA right now; could
> >> someone file an issue to resolve this?
> >>
> >> I also think the spec should be next to the definition of the URN,
> >> that's one of the reason the URNs were originally in a markdown file
> >> (to encourage good documentation, literate programming style). Many
> >> coders already have their specs there.
> >>
> >> Regarding backwards compatibility, we can't change existing coders,
> >> and making new coders won't help with inference ('cause changing that
> >> would also be backwards incompatible). Fortunately, I think we're
> >> already doing the consistent thing here: In both Python and Java the
> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
> >> used in a *nested* context.
> >>
> >> I'd really like to see the whole nested/unnested context go away, but
> >> that'll probably require Beam 3.0; it causes way more confusion than
> >> the couple of bytes it saves in a couple of places.
> >>
> >> - Robert
> >>
> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com>
> wrote:
> >> >
> >> > My 2cents is that the "Textual description" should be part of the
> documentation of the URNs on the Proto messages, since that's the common
> place. I've added a short description for the varints for example, and we
> already have lenghthier format & protocol descriptions there for iterables
> and similar.
> >> >
> >> > The proto [1] *can be* the spec if we want it to be.
> >> >
> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >
> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com>
> wrote:
> >> >>>
> >> >>> We should probably move the "java" version of the yaml file [1] to
> a common location rather than deep in the java hierarchy, or copying it for
> Go and Python, but that can be a separate task. It's probably non-trivial
> since it looks like it's part of a java resources structure.
> >> >>
> >> >>
> >> >> Seems like /model is a good place for this if we don't want to
> invent a new language-independent hierarchy.
> >> >>
> >> >> Kenn
> >> >>
> >> >>
> >> >>>
> >> >>> Luke, the Go SDK doesn't currently do this validation, but it
> shouldn't be difficult, given pointers to the Java and Python variants of
> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
> users (such as they are) aren't forced to run them, and not have the yaml
> file to read. I'd suggest putting it with the integration tests [3].
> >> >>>
> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work.
> [4]
> >> >>>
> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >>>
> >> >>>
> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com>
> wrote:
> >> >>>>>>
> >> >>>>>> standard_coders.yaml[1] is where we are currently defining these
> formats.
> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >>>>>
> >> >>>>>
> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a
> separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder
> yet.
> >> >>>>>
> >> >>>>
> >> >>>> Not certain as well. I did notice the timer coder definition
> didn't exist in the Python copy.
> >> >>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as
> tests to the Java and Python SDKs to ensure interoperability.
> >> >>>>>>
> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >>>>>>
> >> >>>>>> Chamikara, creating new coder classes is a pain since the type
> -> coder mapping per SDK language would select the non-well known type if
> we added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >> >>>>>
> >> >>>>>
> >> >>>>> I was thinking that may be we should just change the default
> UTF-8 coder for Fn API path which is experimental. Updating Python to do
> what's done for Java is fine if we agree that encoding used for Java should
> be the standard.
> >> >>>>>
> >> >>>>
> >> >>>> That is a good idea to use the Fn API experiment to control which
> gets selected.
> >> >>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >>>>>>
> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> A URN defines the encoding.
> >> >>>>>>>>
> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
> (defined
> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java
> and
> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
> >> >>>>>>>> unnested one does not.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Could you clarify where we define the exact encoding ? I only
> see a URN for UTF-8 [1] while if you look at the implementations Java
> includes length in the encoding [1] while Python [1] does not.
> >> >>>>>>>
> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> We should define the spec clearly and have cross-language
> tests.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> +1
> >> >>>>>>>
> >> >>>>>>> Regarding backwards compatibility, I agree that we should
> probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> pabloem@google.com> wrote:
> >> >>>>>>>> >
> >> >>>>>>>> > Could this be a backwards-incompatible change that would
> break pipelines from upgrading? If they have data in-flight in between
> operators, and we change the coder, they would break?
> >> >>>>>>>> > I know very little about coders, but since nobody has
> mentioned it, I wanted to make sure we have it in mind.
> >> >>>>>>>> > -P.
> >> >>>>>>>> >
> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >> >>>>>>>> >>
> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that
> string UTF-8 was added to the proto enum, but it needs a written spec of
> the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >>>>>>>> >>
> >> >>>>>>>> >> Kenn
> >> >>>>>>>> >>
> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >> >>>>>>>> >>>
> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN
> in the protos, but I don't think that developed beyond Java, so adding it
> to Python would be reasonable in my opinion.
> >> >>>>>>>> >>>
> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently
> which for Go are always length prefixed (and reported to the Runner as
> LP+CustomCoder). It would be straight forward to add the correct handling
> for strings, as Go natively treats strings as UTF8.
> >> >>>>>>>> >>>
> >> >>>>>>>> >>>
> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> heejong@google.com> wrote:
> >> >>>>>>>> >>>>
> >> >>>>>>>> >>>> Hi all,
> >> >>>>>>>> >>>>
> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs
> uses different encoding schemes. StringUtf8Coder in Java SDK puts the
> varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >>>>>>>> >>>>
> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <he...@google.com> wrote:
>
> Robert, does nested/unnested context work properly for Java?

I believe so. It is similar to the bytes coder, that prefixes vs. not
based on the context.

> I can see that the Context is fixed to NESTED[1] and the encode method with the Context parameter is marked as deprecated[2].
>
> [1]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> [2]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132

That doesn't mean it's unused, e.g.

https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64

(and I'm sure there's others).

> On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> I don't know why there are two separate copies of
>> standard_coders.yaml--originally there was just one (though it did
>> live in the Python directory). I'm guessing a copy was made rather
>> than just pointing both to the new location, but that completely
>> defeats the point. I can't seem to access JIRA right now; could
>> someone file an issue to resolve this?
>>
>> I also think the spec should be next to the definition of the URN,
>> that's one of the reason the URNs were originally in a markdown file
>> (to encourage good documentation, literate programming style). Many
>> coders already have their specs there.
>>
>> Regarding backwards compatibility, we can't change existing coders,
>> and making new coders won't help with inference ('cause changing that
>> would also be backwards incompatible). Fortunately, I think we're
>> already doing the consistent thing here: In both Python and Java the
>> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> used in a *nested* context.
>>
>> I'd really like to see the whole nested/unnested context go away, but
>> that'll probably require Beam 3.0; it causes way more confusion than
>> the couple of bytes it saves in a couple of places.
>>
>> - Robert
>>
>> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
>> >
>> > My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar.
>> >
>> > The proto [1] *can be* the spec if we want it to be.
>> >
>> > [1]: https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >
>> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
>> >>
>> >>
>> >>
>> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>> >>>
>> >>> We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure.
>> >>
>> >>
>> >> Seems like /model is a good place for this if we don't want to invent a new language-independent hierarchy.
>> >>
>> >> Kenn
>> >>
>> >>
>> >>>
>> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be difficult, given pointers to the Java and Python variants of the tests to crib from [2]. Care would need to be taken so that Beam Go SDK users (such as they are) aren't forced to run them, and not have the yaml file to read. I'd suggest putting it with the integration tests [3].
>> >>>
>> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >>>
>> >>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >>> 2: https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >>>
>> >>>
>> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>> >>>>>>
>> >>>>>> standard_coders.yaml[1] is where we are currently defining these formats.
>> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >>>>>
>> >>>>>
>> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy for Python ? I didn't see a significant difference in definitions looking at few random coders there but I might have missed something. If there's no reason to maintain two, we should probably unify.
>> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>> >>>>>
>> >>>>
>> >>>> Not certain as well. I did notice the timer coder definition didn't exist in the Python copy.
>> >>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability.
>> >>>>>>
>> >>>>>> Robert Burke, does the Go SDK have a test where it uses standard_coders.yaml and runs compatibility tests?
>> >>>>>>
>> >>>>>> Chamikara, creating new coder classes is a pain since the type -> coder mapping per SDK language would select the non-well known type if we added a new one to a language. If we swapped the default type->coder mapping, this would still break update for pipelines forcing users to update their code to select the non-well known type. If we don't change the default type->coder mapping, the well known coder will gain little usage. I think we should fix the Python coder to use the same encoding as Java for UTF-8 strings before there are too many Python SDK users.
>> >>>>>
>> >>>>>
>> >>>>> I was thinking that may be we should just change the default UTF-8 coder for Fn API path which is experimental. Updating Python to do what's done for Java is fine if we agree that encoding used for Java should be the standard.
>> >>>>>
>> >>>>
>> >>>> That is a good idea to use the Fn API experiment to control which gets selected.
>> >>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >>>>>> 2: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >>>>>>
>> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com> wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:
>> >>>>>>>>
>> >>>>>>>> A URN defines the encoding.
>> >>>>>>>>
>> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >>>>>>>> unnested one does not.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Could you clarify where we define the exact encoding ? I only see a URN for UTF-8 [1] while if you look at the implementations Java includes length in the encoding [1] while Python [1] does not.
>> >>>>>>>
>> >>>>>>> [1] https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >>>>>>> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> We should define the spec clearly and have cross-language tests.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> +1
>> >>>>>>>
>> >>>>>>> Regarding backwards compatibility, I agree that we should probably not update existing coder classes. Probably we should just standardize the correct encoding (may be as a comment near corresponding URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >>>>>>>> >
>> >>>>>>>> > Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
>> >>>>>>>> > I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
>> >>>>>>>> > -P.
>> >>>>>>>> >
>> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >>>>>>>> >>
>> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>> >>>>>>>> >>
>> >>>>>>>> >> Kenn
>> >>>>>>>> >>
>> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>> >>>>>>>> >>>
>> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>> >>>>>>>> >>>
>> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>> >>>>>>>> >>>
>> >>>>>>>> >>>
>> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> Hi all,
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Heejong Lee <he...@google.com>.

Robert, does nested/unnested context work properly for Java? I can see that
the Context is fixed to NESTED[1] and the encode method with the Context
parameter is marked as deprecated[2].

[1]:
https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
[2]:
https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132

On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <ro...@google.com> wrote:

> I don't know why there are two separate copies of
> standard_coders.yaml--originally there was just one (though it did
> live in the Python directory). I'm guessing a copy was made rather
> than just pointing both to the new location, but that completely
> defeats the point. I can't seem to access JIRA right now; could
> someone file an issue to resolve this?
>
> I also think the spec should be next to the definition of the URN,
> that's one of the reason the URNs were originally in a markdown file
> (to encourage good documentation, literate programming style). Many
> coders already have their specs there.
>
> Regarding backwards compatibility, we can't change existing coders,
> and making new coders won't help with inference ('cause changing that
> would also be backwards incompatible). Fortunately, I think we're
> already doing the consistent thing here: In both Python and Java the
> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
> and the length-prefixed UTF-8 encoded bytes are used when the coder is
> used in a *nested* context.
>
> I'd really like to see the whole nested/unnested context go away, but
> that'll probably require Beam 3.0; it causes way more confusion than
> the couple of bytes it saves in a couple of places.
>
> - Robert
>
> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
> >
> > My 2cents is that the "Textual description" should be part of the
> documentation of the URNs on the Proto messages, since that's the common
> place. I've added a short description for the varints for example, and we
> already have lenghthier format & protocol descriptions there for iterables
> and similar.
> >
> > The proto [1] *can be* the spec if we want it to be.
> >
> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >
> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >>
> >>
> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
> >>>
> >>> We should probably move the "java" version of the yaml file [1] to a
> common location rather than deep in the java hierarchy, or copying it for
> Go and Python, but that can be a separate task. It's probably non-trivial
> since it looks like it's part of a java resources structure.
> >>
> >>
> >> Seems like /model is a good place for this if we don't want to invent a
> new language-independent hierarchy.
> >>
> >> Kenn
> >>
> >>
> >>>
> >>> Luke, the Go SDK doesn't currently do this validation, but it
> shouldn't be difficult, given pointers to the Java and Python variants of
> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
> users (such as they are) aren't forced to run them, and not have the yaml
> file to read. I'd suggest putting it with the integration tests [3].
> >>>
> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
> >>>
> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >>>
> >>>
> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com>
> wrote:
> >>>>>>
> >>>>>> standard_coders.yaml[1] is where we are currently defining these
> formats.
> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >>>>>
> >>>>>
> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate
> copy for Python ? I didn't see a significant difference in definitions
> looking at few random coders there but I might have missed something. If
> there's no reason to maintain two, we should probably unify.
> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
> >>>>>
> >>>>
> >>>> Not certain as well. I did notice the timer coder definition didn't
> exist in the Python copy.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as
> tests to the Java and Python SDKs to ensure interoperability.
> >>>>>>
> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >>>>>>
> >>>>>> Chamikara, creating new coder classes is a pain since the type ->
> coder mapping per SDK language would select the non-well known type if we
> added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >>>>>
> >>>>>
> >>>>> I was thinking that may be we should just change the default UTF-8
> coder for Fn API path which is experimental. Updating Python to do what's
> done for Java is fine if we agree that encoding used for Java should be the
> standard.
> >>>>>
> >>>>
> >>>> That is a good idea to use the Fn API experiment to control which
> gets selected.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >>>>>>
> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamikara@google.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>>>>>>>
> >>>>>>>> A URN defines the encoding.
> >>>>>>>>
> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
> (defined
> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
> >>>>>>>> unnested one does not.
> >>>>>>>
> >>>>>>>
> >>>>>>> Could you clarify where we define the exact encoding ? I only see
> a URN for UTF-8 [1] while if you look at the implementations Java includes
> length in the encoding [1] while Python [1] does not.
> >>>>>>>
> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> We should define the spec clearly and have cross-language tests.
> >>>>>>>
> >>>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Regarding backwards compatibility, I agree that we should probably
> not update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
> wrote:
> >>>>>>>> >
> >>>>>>>> > Could this be a backwards-incompatible change that would break
> pipelines from upgrading? If they have data in-flight in between operators,
> and we change the coder, they would break?
> >>>>>>>> > I know very little about coders, but since nobody has mentioned
> it, I wanted to make sure we have it in mind.
> >>>>>>>> > -P.
> >>>>>>>> >
> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>>>>> >>
> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string
> UTF-8 was added to the proto enum, but it needs a written spec of the
> encoding. Ideally some test data that different languages can use to drive
> compliance testing.
> >>>>>>>> >>
> >>>>>>>> >> Kenn
> >>>>>>>> >>
> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> robert@frantil.com> wrote:
> >>>>>>>> >>>
> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in
> the protos, but I don't think that developed beyond Java, so adding it to
> Python would be reasonable in my opinion.
> >>>>>>>> >>>
> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which
> for Go are always length prefixed (and reported to the Runner as
> LP+CustomCoder). It would be straight forward to add the correct handling
> for strings, as Go natively treats strings as UTF8.
> >>>>>>>> >>>
> >>>>>>>> >>>
> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
> wrote:
> >>>>>>>> >>>>
> >>>>>>>> >>>> Hi all,
> >>>>>>>> >>>>
> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs
> uses different encoding schemes. StringUtf8Coder in Java SDK puts the
> varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >>>>>>>> >>>>
> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

I don't know why there are two separate copies of
standard_coders.yaml--originally there was just one (though it did
live in the Python directory). I'm guessing a copy was made rather
than just pointing both to the new location, but that completely
defeats the point. I can't seem to access JIRA right now; could
someone file an issue to resolve this?

I also think the spec should be next to the definition of the URN,
that's one of the reason the URNs were originally in a markdown file
(to encourage good documentation, literate programming style). Many
coders already have their specs there.

Regarding backwards compatibility, we can't change existing coders,
and making new coders won't help with inference ('cause changing that
would also be backwards incompatible). Fortunately, I think we're
already doing the consistent thing here: In both Python and Java the
raw UTF-8 encoded bytes are encoded when used in an *unnested* context
and the length-prefixed UTF-8 encoded bytes are used when the coder is
used in a *nested* context.

I'd really like to see the whole nested/unnested context go away, but
that'll probably require Beam 3.0; it causes way more confusion than
the couple of bytes it saves in a couple of places.

- Robert

On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <ro...@frantil.com> wrote:
>
> My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar.
>
> The proto [1] *can be* the spec if we want it to be.
>
> [1]: https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>
> On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:
>>
>>
>>
>> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>>>
>>> We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure.
>>
>>
>> Seems like /model is a good place for this if we don't want to invent a new language-independent hierarchy.
>>
>> Kenn
>>
>>
>>>
>>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be difficult, given pointers to the Java and Python variants of the tests to crib from [2]. Care would need to be taken so that Beam Go SDK users (such as they are) aren't forced to run them, and not have the yaml file to read. I'd suggest putting it with the integration tests [3].
>>>
>>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>>>
>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> 2: https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>>>
>>>
>>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>> standard_coders.yaml[1] is where we are currently defining these formats.
>>>>>> Unfortunately the Python SDK has its own copy[2].
>>>>>
>>>>>
>>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy for Python ? I didn't see a significant difference in definitions looking at few random coders there but I might have missed something. If there's no reason to maintain two, we should probably unify.
>>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>>>>
>>>>
>>>> Not certain as well. I did notice the timer coder definition didn't exist in the Python copy.
>>>>
>>>>>>
>>>>>>
>>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability.
>>>>>>
>>>>>> Robert Burke, does the Go SDK have a test where it uses standard_coders.yaml and runs compatibility tests?
>>>>>>
>>>>>> Chamikara, creating new coder classes is a pain since the type -> coder mapping per SDK language would select the non-well known type if we added a new one to a language. If we swapped the default type->coder mapping, this would still break update for pipelines forcing users to update their code to select the non-well known type. If we don't change the default type->coder mapping, the well known coder will gain little usage. I think we should fix the Python coder to use the same encoding as Java for UTF-8 strings before there are too many Python SDK users.
>>>>>
>>>>>
>>>>> I was thinking that may be we should just change the default UTF-8 coder for Fn API path which is experimental. Updating Python to do what's done for Java is fine if we agree that encoding used for Java should be the standard.
>>>>>
>>>>
>>>> That is a good idea to use the Fn API experiment to control which gets selected.
>>>>
>>>>>>
>>>>>>
>>>>>> 1: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>>>> 2: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>>>> 3: https://github.com/apache/beam/pull/8205
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>>
>>>>>>>> A URN defines the encoding.
>>>>>>>>
>>>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>>>> unnested one does not.
>>>>>>>
>>>>>>>
>>>>>>> Could you clarify where we define the exact encoding ? I only see a URN for UTF-8 [1] while if you look at the implementations Java includes length in the encoding [1] while Python [1] does not.
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>>>> [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>>>> [3] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Regarding backwards compatibility, I agree that we should probably not update existing coder classes. Probably we should just standardize the correct encoding (may be as a comment near corresponding URN in the beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>>>>>>>> >
>>>>>>>> > Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
>>>>>>>> > I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
>>>>>>>> > -P.
>>>>>>>> >
>>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>>> >>
>>>>>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>>>>>>>> >>
>>>>>>>> >> Kenn
>>>>>>>> >>
>>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>>>>>>>> >>>
>>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> Hi all,
>>>>>>>> >>>>
>>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Burke <ro...@frantil.com>.

My 2cents is that the "Textual description" should be part of the
documentation of the URNs on the Proto messages, since that's the common
place. I've added a short description for the varints for example, and we
already have lenghthier format & protocol descriptions there for iterables
and similar.

The proto [1] *can be* the spec if we want it to be.

[1]:
https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557

On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <ke...@apache.org> wrote:

>
>
> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:
>
>> We should probably move the "java" version of the yaml file [1] to a
>> common location rather than deep in the java hierarchy, or copying it for
>> Go and Python, but that can be a separate task. It's probably non-trivial
>> since it looks like it's part of a java resources structure.
>>
>
> Seems like /model is a good place for this if we don't want to invent a
> new language-independent hierarchy.
>
> Kenn
>
>
>
>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't
>> be difficult, given pointers to the Java and Python variants of the tests
>> to crib from [2]. Care would need to be taken so that Beam Go SDK users
>> (such as they are) aren't forced to run them, and not have the yaml file to
>> read. I'd suggest putting it with the integration tests [3].
>>
>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>>
>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> 2:
>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> <https://github.com/apache/beam/tree/master/sdks/go/test>
>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>>
>>
>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>>
>>>
>>>
>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> standard_coders.yaml[1] is where we are currently defining these
>>>>> formats.
>>>>> Unfortunately the Python SDK has its own copy[2].
>>>>>
>>>>
>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>>>> for Python ? I didn't see a significant difference in definitions looking
>>>> at few random coders there but I might have missed something. If there's no
>>>> reason to maintain two, we should probably unify.
>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>>>
>>>>
>>> Not certain as well. I did notice the timer coder definition didn't
>>> exist in the Python copy.
>>>
>>>
>>>>
>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>>>> to the Java and Python SDKs to ensure interoperability.
>>>>>
>>>>> Robert Burke, does the Go SDK have a test where it uses
>>>>> standard_coders.yaml and runs compatibility tests?
>>>>>
>>>>> Chamikara, creating new coder classes is a pain since the type ->
>>>>> coder mapping per SDK language would select the non-well known type if we
>>>>> added a new one to a language. If we swapped the default type->coder
>>>>> mapping, this would still break update for pipelines forcing users to
>>>>> update their code to select the non-well known type. If we don't change the
>>>>> default type->coder mapping, the well known coder will gain little usage. I
>>>>> think we should fix the Python coder to use the same encoding as Java for
>>>>> UTF-8 strings before there are too many Python SDK users.
>>>>>
>>>>
>>>> I was thinking that may be we should just change the default UTF-8
>>>> coder for Fn API path which is experimental. Updating Python to do what's
>>>> done for Java is fine if we agree that encoding used for Java should be the
>>>> standard.
>>>>
>>>>
>>> That is a good idea to use the Fn API experiment to control which gets
>>> selected.
>>>
>>>
>>>>
>>>>> 1:
>>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>>> 2:
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>>> 3: https://github.com/apache/beam/pull/8205
>>>>>
>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> A URN defines the encoding.
>>>>>>>
>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
>>>>>>> (defined
>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>>> unnested one does not.
>>>>>>>
>>>>>>
>>>>>> Could you clarify where we define the exact encoding ? I only see a
>>>>>> URN for UTF-8 [1] while if you look at the implementations Java includes
>>>>>> length in the encoding [1] while Python [1] does not.
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>>> [3]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> Regarding backwards compatibility, I agree that we should probably
>>>>>> not update existing coder classes. Probably we should just standardize the
>>>>>> correct encoding (may be as a comment near corresponding URN in the
>>>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Could this be a backwards-incompatible change that would break
>>>>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>>>>> and we change the coder, they would break?
>>>>>>> > I know very little about coders, but since nobody has mentioned
>>>>>>> it, I wanted to make sure we have it in mind.
>>>>>>> > -P.
>>>>>>> >
>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>>>> encoding. Ideally some test data that different languages can use to drive
>>>>>>> compliance testing.
>>>>>>> >>
>>>>>>> >> Kenn
>>>>>>> >>
>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>>>> Python would be reasonable in my opinion.
>>>>>>> >>>
>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which
>>>>>>> for Go are always length prefixed (and reported to the Runner as
>>>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>>>> for strings, as Go natively treats strings as UTF8.
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>>>>> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Hi all,
>>>>>>> >>>>
>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>>>> standard coder. Any thoughts?
>>>>>>> >>>>
>>>>>>> >>>> Thanks,
>>>>>>>
>>>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <ro...@frantil.com> wrote:

> We should probably move the "java" version of the yaml file [1] to a
> common location rather than deep in the java hierarchy, or copying it for
> Go and Python, but that can be a separate task. It's probably non-trivial
> since it looks like it's part of a java resources structure.
>

Seems like /model is a good place for this if we don't want to invent a new
language-independent hierarchy.

Kenn



> Luke, the Go SDK doesn't currently do this validation, but it shouldn't be
> difficult, given pointers to the Java and Python variants of the tests to
> crib from [2]. Care would need to be taken so that Beam Go SDK users (such
> as they are) aren't forced to run them, and not have the yaml file to read.
> I'd suggest putting it with the integration tests [3].
>
> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>
> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> <https://github.com/apache/beam/tree/master/sdks/go/test>
> 4: https://issues.apache.org/jira/browse/BEAM-7009
>
>
> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> standard_coders.yaml[1] is where we are currently defining these
>>>> formats.
>>>> Unfortunately the Python SDK has its own copy[2].
>>>>
>>>
>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>>> for Python ? I didn't see a significant difference in definitions looking
>>> at few random coders there but I might have missed something. If there's no
>>> reason to maintain two, we should probably unify.
>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>>
>>>
>> Not certain as well. I did notice the timer coder definition didn't exist
>> in the Python copy.
>>
>>
>>>
>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>>> to the Java and Python SDKs to ensure interoperability.
>>>>
>>>> Robert Burke, does the Go SDK have a test where it uses
>>>> standard_coders.yaml and runs compatibility tests?
>>>>
>>>> Chamikara, creating new coder classes is a pain since the type -> coder
>>>> mapping per SDK language would select the non-well known type if we added a
>>>> new one to a language. If we swapped the default type->coder mapping, this
>>>> would still break update for pipelines forcing users to update their code
>>>> to select the non-well known type. If we don't change the default
>>>> type->coder mapping, the well known coder will gain little usage. I think
>>>> we should fix the Python coder to use the same encoding as Java for UTF-8
>>>> strings before there are too many Python SDK users.
>>>>
>>>
>>> I was thinking that may be we should just change the default UTF-8 coder
>>> for Fn API path which is experimental. Updating Python to do what's done
>>> for Java is fine if we agree that encoding used for Java should be the
>>> standard.
>>>
>>>
>> That is a good idea to use the Fn API experiment to control which gets
>> selected.
>>
>>
>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>> 2:
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>> 3: https://github.com/apache/beam/pull/8205
>>>>
>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> A URN defines the encoding.
>>>>>>
>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>> unnested one does not.
>>>>>>
>>>>>
>>>>> Could you clarify where we define the exact encoding ? I only see a
>>>>> URN for UTF-8 [1] while if you look at the implementations Java includes
>>>>> length in the encoding [1] while Python [1] does not.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>> Regarding backwards compatibility, I agree that we should probably not
>>>>> update existing coder classes. Probably we should just standardize the
>>>>> correct encoding (may be as a comment near corresponding URN in the
>>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Could this be a backwards-incompatible change that would break
>>>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>>>> and we change the coder, they would break?
>>>>>> > I know very little about coders, but since nobody has mentioned it,
>>>>>> I wanted to make sure we have it in mind.
>>>>>> > -P.
>>>>>> >
>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>>> encoding. Ideally some test data that different languages can use to drive
>>>>>> compliance testing.
>>>>>> >>
>>>>>> >> Kenn
>>>>>> >>
>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>>> Python would be reasonable in my opinion.
>>>>>> >>>
>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>>>> Go are always length prefixed (and reported to the Runner as
>>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>>> for strings, as Go natively treats strings as UTF8.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi all,
>>>>>> >>>>
>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>>> standard coder. Any thoughts?
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>>
>>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Burke <ro...@frantil.com>.

We should probably move the "java" version of the yaml file [1] to a common
location rather than deep in the java hierarchy, or copying it for Go and
Python, but that can be a separate task. It's probably non-trivial since it
looks like it's part of a java resources structure.

Luke, the Go SDK doesn't currently do this validation, but it shouldn't be
difficult, given pointers to the Java and Python variants of the tests to
crib from [2]. Care would need to be taken so that Beam Go SDK users (such
as they are) aren't forced to run them, and not have the yaml file to read.
I'd suggest putting it with the integration tests [3].

I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]

1:
https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
2:
https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
3: https://github.com/apache/beam/tree/master/sdks/go/test
<https://github.com/apache/beam/tree/master/sdks/go/test>
4: https://issues.apache.org/jira/browse/BEAM-7009


On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <lc...@google.com> wrote:

>
>
> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> standard_coders.yaml[1] is where we are currently defining these formats.
>>> Unfortunately the Python SDK has its own copy[2].
>>>
>>
>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>> for Python ? I didn't see a significant difference in definitions looking
>> at few random coders there but I might have missed something. If there's no
>> reason to maintain two, we should probably unify.
>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>
>>
> Not certain as well. I did notice the timer coder definition didn't exist
> in the Python copy.
>
>
>>
>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>> to the Java and Python SDKs to ensure interoperability.
>>>
>>> Robert Burke, does the Go SDK have a test where it uses
>>> standard_coders.yaml and runs compatibility tests?
>>>
>>> Chamikara, creating new coder classes is a pain since the type -> coder
>>> mapping per SDK language would select the non-well known type if we added a
>>> new one to a language. If we swapped the default type->coder mapping, this
>>> would still break update for pipelines forcing users to update their code
>>> to select the non-well known type. If we don't change the default
>>> type->coder mapping, the well known coder will gain little usage. I think
>>> we should fix the Python coder to use the same encoding as Java for UTF-8
>>> strings before there are too many Python SDK users.
>>>
>>
>> I was thinking that may be we should just change the default UTF-8 coder
>> for Fn API path which is experimental. Updating Python to do what's done
>> for Java is fine if we agree that encoding used for Java should be the
>> standard.
>>
>>
> That is a good idea to use the Fn API experiment to control which gets
> selected.
>
>
>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>> 2:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>> 3: https://github.com/apache/beam/pull/8205
>>>
>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> A URN defines the encoding.
>>>>>
>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>> unnested one does not.
>>>>>
>>>>
>>>> Could you clarify where we define the exact encoding ? I only see a URN
>>>> for UTF-8 [1] while if you look at the implementations Java includes length
>>>> in the encoding [1] while Python [1] does not.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>> [3]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>
>>>>
>>>>
>>>>>
>>>>> We should define the spec clearly and have cross-language tests.
>>>>>
>>>>
>>>> +1
>>>>
>>>> Regarding backwards compatibility, I agree that we should probably not
>>>> update existing coder classes. Probably we should just standardize the
>>>> correct encoding (may be as a comment near corresponding URN in the
>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>
>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>>> wrote:
>>>>> >
>>>>> > Could this be a backwards-incompatible change that would break
>>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>>> and we change the coder, they would break?
>>>>> > I know very little about coders, but since nobody has mentioned it,
>>>>> I wanted to make sure we have it in mind.
>>>>> > -P.
>>>>> >
>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>> encoding. Ideally some test data that different languages can use to drive
>>>>> compliance testing.
>>>>> >>
>>>>> >> Kenn
>>>>> >>
>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>> Python would be reasonable in my opinion.
>>>>> >>>
>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>>> Go are always length prefixed (and reported to the Runner as
>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>> for strings, as Go natively treats strings as UTF8.
>>>>> >>>
>>>>> >>>
>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi all,
>>>>> >>>>
>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>> standard coder. Any thoughts?
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>>
>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Lukasz Cwik <lc...@google.com>.

On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> standard_coders.yaml[1] is where we are currently defining these formats.
>> Unfortunately the Python SDK has its own copy[2].
>>
>
> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
> for Python ? I didn't see a significant difference in definitions looking
> at few random coders there but I might have missed something. If there's no
> reason to maintain two, we should probably unify.
> Also, seems like we haven't added the definition for UTF-8 coder yet.
>
>
Not certain as well. I did notice the timer coder definition didn't exist
in the Python copy.


>
>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to
>> the Java and Python SDKs to ensure interoperability.
>>
>> Robert Burke, does the Go SDK have a test where it uses
>> standard_coders.yaml and runs compatibility tests?
>>
>> Chamikara, creating new coder classes is a pain since the type -> coder
>> mapping per SDK language would select the non-well known type if we added a
>> new one to a language. If we swapped the default type->coder mapping, this
>> would still break update for pipelines forcing users to update their code
>> to select the non-well known type. If we don't change the default
>> type->coder mapping, the well known coder will gain little usage. I think
>> we should fix the Python coder to use the same encoding as Java for UTF-8
>> strings before there are too many Python SDK users.
>>
>
> I was thinking that may be we should just change the default UTF-8 coder
> for Fn API path which is experimental. Updating Python to do what's done
> for Java is fine if we agree that encoding used for Java should be the
> standard.
>
>
That is a good idea to use the Fn API experiment to control which gets
selected.


>
>> 1:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> 2:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> 3: https://github.com/apache/beam/pull/8205
>>
>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> A URN defines the encoding.
>>>>
>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>> Python, the nested one prefixes with a var-int length, and the
>>>> unnested one does not.
>>>>
>>>
>>> Could you clarify where we define the exact encoding ? I only see a URN
>>> for UTF-8 [1] while if you look at the implementations Java includes length
>>> in the encoding [1] while Python [1] does not.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>
>>>
>>>
>>>>
>>>> We should define the spec clearly and have cross-language tests.
>>>>
>>>
>>> +1
>>>
>>> Regarding backwards compatibility, I agree that we should probably not
>>> update existing coder classes. Probably we should just standardize the
>>> correct encoding (may be as a comment near corresponding URN in the
>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>
>>>
>>>>
>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com>
>>>> wrote:
>>>> >
>>>> > Could this be a backwards-incompatible change that would break
>>>> pipelines from upgrading? If they have data in-flight in between operators,
>>>> and we change the coder, they would break?
>>>> > I know very little about coders, but since nobody has mentioned it, I
>>>> wanted to make sure we have it in mind.
>>>> > -P.
>>>> >
>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >>
>>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>>>> was added to the proto enum, but it needs a written spec of the encoding.
>>>> Ideally some test data that different languages can use to drive compliance
>>>> testing.
>>>> >>
>>>> >> Kenn
>>>> >>
>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>> >>>
>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>> Python would be reasonable in my opinion.
>>>> >>>
>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>> Go are always length prefixed (and reported to the Runner as
>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>> for strings, as Go natively treats strings as UTF8.
>>>> >>>
>>>> >>>
>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi all,
>>>> >>>>
>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>> length of the input string before actual data bytes however StrUtf8Coder in
>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>> standard coder. Any thoughts?
>>>> >>>>
>>>> >>>> Thanks,
>>>>
>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Chamikara Jayalath <ch...@google.com>.

On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <lc...@google.com> wrote:

> standard_coders.yaml[1] is where we are currently defining these formats.
> Unfortunately the Python SDK has its own copy[2].
>

Ah great. Thanks for the pointer. Any idea why there's  a separate copy for
Python ? I didn't see a significant difference in definitions looking at
few random coders there but I might have missed something. If there's no
reason to maintain two, we should probably unify.
Also, seems like we haven't added the definition for UTF-8 coder yet.


>
> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to
> the Java and Python SDKs to ensure interoperability.
>
> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
>
> Chamikara, creating new coder classes is a pain since the type -> coder
> mapping per SDK language would select the non-well known type if we added a
> new one to a language. If we swapped the default type->coder mapping, this
> would still break update for pipelines forcing users to update their code
> to select the non-well known type. If we don't change the default
> type->coder mapping, the well known coder will gain little usage. I think
> we should fix the Python coder to use the same encoding as Java for UTF-8
> strings before there are too many Python SDK users.
>

I was thinking that may be we should just change the default UTF-8 coder
for Fn API path which is experimental. Updating Python to do what's done
for Java is fine if we agree that encoding used for Java should be the
standard.


>
> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> 3: https://github.com/apache/beam/pull/8205
>
> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> A URN defines the encoding.
>>>
>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>> Python, the nested one prefixes with a var-int length, and the
>>> unnested one does not.
>>>
>>
>> Could you clarify where we define the exact encoding ? I only see a URN
>> for UTF-8 [1] while if you look at the implementations Java includes length
>> in the encoding [1] while Python [1] does not.
>>
>> [1]
>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> [3]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>
>>
>>
>>>
>>> We should define the spec clearly and have cross-language tests.
>>>
>>
>> +1
>>
>> Regarding backwards compatibility, I agree that we should probably not
>> update existing coder classes. Probably we should just standardize the
>> correct encoding (may be as a comment near corresponding URN in the
>> beam_runner_api.proto ?) and create new coder classes as needed.
>>
>>
>>>
>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>>> >
>>> > Could this be a backwards-incompatible change that would break
>>> pipelines from upgrading? If they have data in-flight in between operators,
>>> and we change the coder, they would break?
>>> > I know very little about coders, but since nobody has mentioned it, I
>>> wanted to make sure we have it in mind.
>>> > -P.
>>> >
>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>
>>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>>> was added to the proto enum, but it needs a written spec of the encoding.
>>> Ideally some test data that different languages can use to drive compliance
>>> testing.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>>> wrote:
>>> >>>
>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>> protos, but I don't think that developed beyond Java, so adding it to
>>> Python would be reasonable in my opinion.
>>> >>>
>>> >>> The Go SDK handles Strings as "custom coders" presently which for Go
>>> are always length prefixed (and reported to the Runner as LP+CustomCoder).
>>> It would be straight forward to add the correct handling for strings, as Go
>>> natively treats strings as UTF8.
>>> >>>
>>> >>>
>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>>> >>>>
>>> >>>> Hi all,
>>> >>>>
>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>> length of the input string before actual data bytes however StrUtf8Coder in
>>> Python SDK directly encodes the input string to bytes value. For the last
>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>> standard coder. Any thoughts?
>>> >>>>
>>> >>>> Thanks,
>>>
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Lukasz Cwik <lc...@google.com>.

standard_coders.yaml[1] is where we are currently defining these formats.
Unfortunately the Python SDK has its own copy[2].

Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to
the Java and Python SDKs to ensure interoperability.

Robert Burke, does the Go SDK have a test where it uses
standard_coders.yaml and runs compatibility tests?

Chamikara, creating new coder classes is a pain since the type -> coder
mapping per SDK language would select the non-well known type if we added a
new one to a language. If we swapped the default type->coder mapping, this
would still break update for pipelines forcing users to update their code
to select the non-well known type. If we don't change the default
type->coder mapping, the well known coder will gain little usage. I think
we should fix the Python coder to use the same encoding as Java for UTF-8
strings before there are too many Python SDK users.

1:
https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
2:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
3: https://github.com/apache/beam/pull/8205

On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> A URN defines the encoding.
>>
>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> Python, the nested one prefixes with a var-int length, and the
>> unnested one does not.
>>
>
> Could you clarify where we define the exact encoding ? I only see a URN
> for UTF-8 [1] while if you look at the implementations Java includes length
> in the encoding [1] while Python [1] does not.
>
> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>
>
>
>>
>> We should define the spec clearly and have cross-language tests.
>>
>
> +1
>
> Regarding backwards compatibility, I agree that we should probably not
> update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
>
>
>>
>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >
>> > Could this be a backwards-incompatible change that would break
>> pipelines from upgrading? If they have data in-flight in between operators,
>> and we change the coder, they would break?
>> > I know very little about coders, but since nobody has mentioned it, I
>> wanted to make sure we have it in mind.
>> > -P.
>> >
>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >>
>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>> was added to the proto enum, but it needs a written spec of the encoding.
>> Ideally some test data that different languages can use to drive compliance
>> testing.
>> >>
>> >> Kenn
>> >>
>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>> wrote:
>> >>>
>> >>> String UTF8 was recently added as a "standard coder " URN in the
>> protos, but I don't think that developed beyond Java, so adding it to
>> Python would be reasonable in my opinion.
>> >>>
>> >>> The Go SDK handles Strings as "custom coders" presently which for Go
>> are always length prefixed (and reported to the Runner as LP+CustomCoder).
>> It would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>> length of the input string before actual data bytes however StrUtf8Coder in
>> Python SDK directly encodes the input string to bytes value. For the last
>> few weeks, I've been testing and fixing cross-language IO transforms and
>> this discrepancy is a major blocker for me. IMO, we should unify the
>> encoding schemes of UTF8 strings across the different SDKs and make it a
>> standard coder. Any thoughts?
>> >>>>
>> >>>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Heejong Lee <he...@google.com>.

On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> A URN defines the encoding.
>>
>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> Python, the nested one prefixes with a var-int length, and the
>> unnested one does not.
>>
>
> Could you clarify where we define the exact encoding ? I only see a URN
> for UTF-8 [1] while if you look at the implementations Java includes length
> in the encoding [1] while Python [1] does not.
>
> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>
>
>
>>
>> We should define the spec clearly and have cross-language tests.
>>
>
> +1
>
> Regarding backwards compatibility, I agree that we should probably not
> update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
>

Then how do we pair the type and the coder? For Java we can explicitly
assign a specific coder to PCollection but for python a coder is inferred
from an element type of PCollection. If we create another standard coder
for utf-8 string, would that new coder be a default for the string element
type?


>
>
>>
>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>> >
>> > Could this be a backwards-incompatible change that would break
>> pipelines from upgrading? If they have data in-flight in between operators,
>> and we change the coder, they would break?
>> > I know very little about coders, but since nobody has mentioned it, I
>> wanted to make sure we have it in mind.
>> > -P.
>> >
>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >>
>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>> was added to the proto enum, but it needs a written spec of the encoding.
>> Ideally some test data that different languages can use to drive compliance
>> testing.
>> >>
>> >> Kenn
>> >>
>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com>
>> wrote:
>> >>>
>> >>> String UTF8 was recently added as a "standard coder " URN in the
>> protos, but I don't think that developed beyond Java, so adding it to
>> Python would be reasonable in my opinion.
>> >>>
>> >>> The Go SDK handles Strings as "custom coders" presently which for Go
>> are always length prefixed (and reported to the Runner as LP+CustomCoder).
>> It would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>> length of the input string before actual data bytes however StrUtf8Coder in
>> Python SDK directly encodes the input string to bytes value. For the last
>> few weeks, I've been testing and fixing cross-language IO transforms and
>> this discrepancy is a major blocker for me. IMO, we should unify the
>> encoding schemes of UTF8 strings across the different SDKs and make it a
>> standard coder. Any thoughts?
>> >>>>
>> >>>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Chamikara Jayalath <ch...@google.com>.

On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <ro...@google.com> wrote:

> A URN defines the encoding.
>
> There are (unfortunately) *two* encodings defined for a Coder (defined
> by a URN), the nested and the unnested one. IIRC, in both Java and
> Python, the nested one prefixes with a var-int length, and the
> unnested one does not.
>

Could you clarify where we define the exact encoding ? I only see a URN for
UTF-8 [1] while if you look at the implementations Java includes length in
the encoding [1] while Python [1] does not.

[1]
https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
[3]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321



>
> We should define the spec clearly and have cross-language tests.
>

+1

Regarding backwards compatibility, I agree that we should probably not
update existing coder classes. Probably we should just standardize the
correct encoding (may be as a comment near corresponding URN in the
beam_runner_api.proto ?) and create new coder classes as needed.


>
> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
> >
> > Could this be a backwards-incompatible change that would break pipelines
> from upgrading? If they have data in-flight in between operators, and we
> change the coder, they would break?
> > I know very little about coders, but since nobody has mentioned it, I
> wanted to make sure we have it in mind.
> > -P.
> >
> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >> Agree that a coder URN defines the encoding. I see that string UTF-8
> was added to the proto enum, but it needs a written spec of the encoding.
> Ideally some test data that different languages can use to drive compliance
> testing.
> >>
> >> Kenn
> >>
> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
> >>>
> >>> String UTF8 was recently added as a "standard coder " URN in the
> protos, but I don't think that developed beyond Java, so adding it to
> Python would be reasonable in my opinion.
> >>>
> >>> The Go SDK handles Strings as "custom coders" presently which for Go
> are always length prefixed (and reported to the Runner as LP+CustomCoder).
> It would be straight forward to add the correct handling for strings, as Go
> natively treats strings as UTF8.
> >>>
> >>>
> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
> length of the input string before actual data bytes however StrUtf8Coder in
> Python SDK directly encodes the input string to bytes value. For the last
> few weeks, I've been testing and fixing cross-language IO transforms and
> this discrepancy is a major blocker for me. IMO, we should unify the
> encoding schemes of UTF8 strings across the different SDKs and make it a
> standard coder. Any thoughts?
> >>>>
> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Bradshaw <ro...@google.com>.

A URN defines the encoding.

There are (unfortunately) *two* encodings defined for a Coder (defined
by a URN), the nested and the unnested one. IIRC, in both Java and
Python, the nested one prefixes with a var-int length, and the
unnested one does not.

We should define the spec clearly and have cross-language tests.

On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pa...@google.com> wrote:
>
> Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break?
> I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind.
> -P.
>
> On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
>>
>> Kenn
>>
>> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>>>
>>> String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion.
>>>
>>> The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runner as LP+CustomCoder). It would be straight forward to add the correct handling for strings, as Go natively treats strings as UTF8.
>>>
>>>
>>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the input string before actual data bytes however StrUtf8Coder in Python SDK directly encodes the input string to bytes value. For the last few weeks, I've been testing and fixing cross-language IO transforms and this discrepancy is a major blocker for me. IMO, we should unify the encoding schemes of UTF8 strings across the different SDKs and make it a standard coder. Any thoughts?
>>>>
>>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Pablo Estrada <pa...@google.com>.

Could this be a backwards-incompatible change that would break pipelines
from upgrading? If they have data in-flight in between operators, and we
change the coder, they would break?
I know very little about coders, but since nobody has mentioned it, I
wanted to make sure we have it in mind.
-P.

On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <ke...@apache.org> wrote:

> Agree that a coder URN defines the encoding. I see that string UTF-8 was
> added to the proto enum, but it needs a written spec of the encoding.
> Ideally some test data that different languages can use to drive compliance
> testing.
>
> Kenn
>
> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:
>
>> String UTF8 was recently added as a "standard coder " URN in the protos,
>> but I don't think that developed beyond Java, so adding it to Python would
>> be reasonable in my opinion.
>>
>> The Go SDK handles Strings as "custom coders" presently which for Go are
>> always length prefixed (and reported to the Runner as LP+CustomCoder). It
>> would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>>
>>
>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>>
>>> Hi all,
>>>
>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different
>>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
>>> input string before actual data bytes however StrUtf8Coder in Python SDK
>>> directly encodes the input string to bytes value. For the last few weeks,
>>> I've been testing and fixing cross-language IO transforms and this
>>> discrepancy is a major blocker for me. IMO, we should unify the encoding
>>> schemes of UTF8 strings across the different SDKs and make it a standard
>>> coder. Any thoughts?
>>>
>>> Thanks,
>>>
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Kenneth Knowles <ke...@apache.org>.

Agree that a coder URN defines the encoding. I see that string UTF-8 was
added to the proto enum, but it needs a written spec of the encoding.
Ideally some test data that different languages can use to drive compliance
testing.

Kenn

On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <ro...@frantil.com> wrote:

> String UTF8 was recently added as a "standard coder " URN in the protos,
> but I don't think that developed beyond Java, so adding it to Python would
> be reasonable in my opinion.
>
> The Go SDK handles Strings as "custom coders" presently which for Go are
> always length prefixed (and reported to the Runner as LP+CustomCoder). It
> would be straight forward to add the correct handling for strings, as Go
> natively treats strings as UTF8.
>
>
> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:
>
>> Hi all,
>>
>> It looks like UTF-8 String Coder in Java and Python SDKs uses different
>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
>> input string before actual data bytes however StrUtf8Coder in Python SDK
>> directly encodes the input string to bytes value. For the last few weeks,
>> I've been testing and fixing cross-language IO transforms and this
>> discrepancy is a major blocker for me. IMO, we should unify the encoding
>> schemes of UTF8 strings across the different SDKs and make it a standard
>> coder. Any thoughts?
>>
>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Posted by Robert Burke <ro...@frantil.com>.

String UTF8 was recently added as a "standard coder " URN in the protos,
but I don't think that developed beyond Java, so adding it to Python would
be reasonable in my opinion.

The Go SDK handles Strings as "custom coders" presently which for Go are
always length prefixed (and reported to the Runner as LP+CustomCoder). It
would be straight forward to add the correct handling for strings, as Go
natively treats strings as UTF8.

On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <he...@google.com> wrote:

> Hi all,
>
> It looks like UTF-8 String Coder in Java and Python SDKs uses different
> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
> input string before actual data bytes however StrUtf8Coder in Python SDK
> directly encodes the input string to bytes value. For the last few weeks,
> I've been testing and fixing cross-language IO transforms and this
> discrepancy is a major blocker for me. IMO, we should unify the encoding
> schemes of UTF8 strings across the different SDKs and make it a standard
> coder. Any thoughts?
>
> Thanks,
>