You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Romain Manni-Bucau <rm...@gmail.com> on 2018/02/04 14:44:56 UTC

coder evolutions?

Hi guys,

I submitted a PR on coders to enhance 1. the user experience 2. the
determinism and handling of coders.

1. the user experience is linked to what i sent some days ago: close
handling of the streams from a coder code. Long story short I add a
SkipCloseCoder which can decorate a coder and just wraps the stream (input
or output) in flavors skipping close() calls. This avoids to do it by
default (which had my preference if you read the related thread but not the
one of everybody) but also makes the usage of a coder with this issue easy
since the of() of the coder just wraps itself in this delagating coder.

2. this one is more nasty and mainly concerns IterableLikeCoders. These
ones use this kind of algorithm (keep in mind they work on a list):

writeSize()
for all element e {
    elementCoder.write(e)
}
writeMagicNumber() // this one depends the size

The decoding is symmetric so I bypass it here.

Indeed all these writes (reads) are done on the same stream. Therefore it
assumes you read as much bytes than you write...which is a huge assumption
for a coder which should by contract assume it can read the stream...as a
stream (until -1).

The idea of the fix is to change this encoding to this kind of algorithm:

writeSize()
for all element e {
    writeElementByteCount(e)
    elementCoder.write(e)
}
writeMagicNumber() // still optionally

This way on the decode size you can wrap the stream by element to enforce
the limitation of the byte count.

Side note: this indeed enforce a limitation due to java byte limitation but
if you check coder code it is already here at the higher level so it is not
a big deal for now.

In terms of implementation it uses a LengthAwareCoder which delegates to
another coder the encoding and just adds the byte count before the actual
serialization. Not perfect but should be more than enough in terms of
support and perf for beam if you think real pipelines (we try to avoid
serializations or it is done on some well known points where this algo
should be enough...worse case it is not a huge overhead, mainly just some
memory overhead).


The PR is available at https://github.com/apache/beam/pull/4594. If you
check you will see I put it "WIP". The main reason is that it changes the
encoding format for containers (lists, iterable, ...) and therefore breaks
python/go/... tests and the standard_coders.yml definition. Some help on
that would be very welcomed.

Technical side note if you wonder: UnownedInputStream doesn't even allow to
mark the stream so there is no real fast way to read the stream as fast as
possible with standard buffering strategies and to support this automatic
IterableCoder wrapping which is implicit. In other words, if beam wants to
support any coder, including the ones not requiring to write the size of
the output - most of the codecs - then we need to change the way it works
to something like that which does it for the user which doesn't know its
coder got wrapped.

Hope it makes sense, if not, don't hesitate to ask questions.

Happy end of week-end.

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

Re: coder evolutions?

Posted by Lukasz Cwik <lc...@google.com>.

I do agree that being able to upgrade the encoding for coders between
pipelines is important and thanks for creating BEAM-3616.

Marking/reset for a coder can only be supported by either the root coder or
every leaf coder in a coder tree unless you wrap each layer with a byte
copying stream. If you don't wrap the stream, it is likely that there will
be an obscure bug and we will read data multiple times. Imagine we have a
List<String> coder and the List coder marks the stream and the String coder
marks the stream, since both mark the stream then the List coders mark will
be lost. Byte copying to support mark is a non-trivial cost.

It is unknown whether length prefixing every element in a coder structure
has significant performance implications. The Python SDK does this for many
of their types beyond a handful they treat specially. A couple of
degenerate cases where a lot of encoding overhead are:
Iterable<SmallValues>: Any small value like a short requires length
prefixing by one byte which leads to a  1/x increase in the size of the
stream. Worst case, iterables of byte values double in encoding size.
KV<KV<KV<A, B>, C>, D>: Nested structures need to length prefix at each
level their inner contents.
The 2.0 Java SDK had this concept of nested and outer contexts that would
get passed into the coder to say whether the coder owned only a part of the
stream or owned the remainder of the stream. In the 2.1 or 2.2 release,
this concept was deprecated and the Java SDK moved closer to the Python
SDK. So if you see the context stuff being used in the Java SDK or access
to the stream in the Python SDK, it's likely handling some degenerate case.

On Mon, Feb 5, 2018 at 1:09 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Thanks, created https://issues.apache.org/jira/browse/BEAM-3616
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 22:12 GMT+01:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>
>> Done
>>
>> Regards
>> JB
>>
>> On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
>> > Works for me. So a jira with target version = 3.
>> >
>> > Can someone with the karma check we have a 3.0.0 in jira system please?
>> >
>> > Le 4 févr. 2018 20:46, "Reuven Lax" <relax@google.com <mailto:
>> relax@google.com>>
>> > a écrit :
>> >
>> >     Seems fine to me. At some point we might want to do an audit of
>> existing
>> >     Jira issues, because I suspect there are issues that should be
>> targeted to
>> >     3.0 but are not yet tagged.
>> >
>> >     On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <
>> jb@nanthrax.net
>> >     <ma...@nanthrax.net>> wrote:
>> >
>> >         I would prefer to use Jira, with "wish"/"ideas", and adding
>> Beam 3.0.0
>> >         version.
>> >
>> >         WDYT ?
>> >
>> >         Regards
>> >         JB
>> >
>> >         On 02/04/2018 07:55 PM, Reuven Lax wrote:
>> >         > Do we have a good place to track the items for Beam 3.0, or
>> is Jira the best
>> >         > place? Romain has a good point - if this gets forgotten when
>> we do Beam 3.0,
>> >         > then we're stuck waiting around till Beam 4.0.
>> >         >
>> >         > Reuven
>> >         >
>> >         > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <
>> jb@nanthrax.net <ma...@nanthrax.net>
>> >         > <mailto:jb@nanthrax.net <ma...@nanthrax.net>>> wrote:
>> >         >
>> >         >     That's a good point. In the roadmap for Beam 3, I think
>> it makes
>> >         sense to add a
>> >         >     point about this.
>> >         >
>> >         >     Regards
>> >         >     JB
>> >         >
>> >         >     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>> >         >     > I think doing a change that would break pipeline update
>> for
>> >         every single user of
>> >         >     > Flink and Dataflow needs to be postponed until a next
>> major
>> >         version. Pipeline
>> >         >     > update is a very frequently used feature, especially by
>> the
>> >         largest users. We've
>> >         >     > had those users get significantly upset even when we
>> >         accidentally broke update
>> >         >     > compatibility for some special cases of individual
>> transforms;
>> >         breaking it
>> >         >     > intentionally and project-wide is too extreme to be
>> justified by
>> >         the benefits of
>> >         >     > the current change.
>> >         >     >
>> >         >     > That said, I think concerns about coder APIs are
>> reasonable, and
>> >         it is
>> >         >     > unfortunate that we effectively can't make changes to
>> them right
>> >         now. It would
>> >         >     > be great if in the next major version we were better
>> prepared
>> >         for evolution of
>> >         >     > coders, e.g. by having coders support a version marker
>> or
>> >         something like that,
>> >         >     > with an API for detecting the version of data on wire
>> and
>> >         reading or writing
>> >         >     > data of an old version. Such a change (introducing
>> versioning)
>> >         would also, of
>> >         >     > course, be incompatible and would need to be postponed
>> until a
>> >         major version -
>> >         >     > but, at least, subsequent changes wouldn't.
>> >         >     >
>> >         >     > ...And as I was typing this email, seems that this is
>> what the
>> >         thread already
>> >         >     > came to!
>> >         >     >
>> >         >     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau
>> >         <rmannibucau@gmail.com <ma...@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>> >         >     > <mailto:rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
>> wrote:
>> >         >     >
>> >         >     >     I like this idea of migration support at coder
>> level. It would require to
>> >         >     >     add a metadata in all outputs which would represent
>> the version then coders
>> >         >     >     can handle the logic properly depending the version
>> - we can assume a coder
>> >         >     >     dev upgrade the version when he breaks the
>> representation I hope ;).
>> >         >     >     With this: no runner impact at all :).
>> >         >     >
>> >         >     >
>> >         >     >     Romain Manni-Bucau
>> >         >     >     @rmannibucau <https://twitter.com/rmannibucau <
>> https://twitter.com/rmannibucau>
>> >         >     <https://twitter.com/rmannibucau <
>> https://twitter.com/rmannibucau>>> |  Blog
>> >         >     >     <https://rmannibucau.metawerx.net/ <
>> https://rmannibucau.metawerx.net/>
>> >         >     <https://rmannibucau.metawerx.net/
>> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
>> >         >     >     <http://rmannibucau.wordpress.com
>> >         <http://rmannibucau.wordpress.com> <
>> http://rmannibucau.wordpress.com
>> >         <http://rmannibucau.wordpress.com>>>
>> >         >     | Github
>> >         >     >     <https://github.com/rmannibucau
>> >         <https://github.com/rmannibucau> <https://github.com/rmannibuca
>> u
>> >         <https://github.com/rmannibucau>>> |
>> >         >     LinkedIn
>> >         >     >     <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>
>> >         >     <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
>> >         >     >
>> >         >
>> >           <https://www.packtpub.com/application-development/java-ee
>> -8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>>>
>> >         >     >
>> >         >     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <
>> relax@google.com <ma...@google.com> <mailto:relax@google.com
>> >         <ma...@google.com>>
>> >         >     >     <mailto:relax@google.com <ma...@google.com>
>> >         <mailto:relax@google.com <ma...@google.com>>>>:
>> >         >     >
>> >         >     >         It would already break quite a number of users
>> at this point.
>> >         >     >
>> >         >     >         I think what we should be doing is moving
>> forward on the snapshot/update
>> >         >     >         proposal. That proposal actually provides a way
>> forward when coders
>> >         >     >         change (it proposes a way to map an old
>> snapshot to one using the new
>> >         >     >         coder, so changes to coders in the future will
>> be much easier to make.
>> >         >     >         However much of the implementation for this
>> will likely be at the runner
>> >         >     >         level, not the SDK level.
>> >         >     >
>> >         >     >         Reuven
>> >         >     >
>> >         >     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain
>> Manni-Bucau
>> >         >     >         <rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>> >         >     <mailto:rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
>> wrote:
>> >         >     >
>> >         >     >             I fully understand that, and this is one of
>> the reason managing to
>> >         >     >             solve these issues is very important and
>> ASAP. My conclusion is that
>> >         >     >             we must break it now to avoid to do it
>> later when usage will be way
>> >         >     >             more developped - I would be very happy to
>> be wrong on that point -
>> >         >     >             so I started this PR and this thread. We
>> can postpone it but it
>> >         >     >             would break later so for probably more
>> users.
>> >         >     >
>> >         >     >
>> >         >     >             Romain Manni-Bucau
>> >         >     >             @rmannibucau <https://twitter.com/rmannibuc
>> au <https://twitter.com/rmannibucau>
>> >         >     <https://twitter.com/rmannibucau <
>> https://twitter.com/rmannibucau>>> |  Blog
>> >         >     >             <https://rmannibucau.metawerx.net/ <
>> https://rmannibucau.metawerx.net/>
>> >         >     <https://rmannibucau.metawerx.net/
>> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
>> >         >     >             <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>
>> >         >     <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>>>
>> >         | Github
>> >         >     >             <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>
>> >         >     <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>>> | LinkedIn
>> >         >     >             <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>
>> >         >     <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
>> >         >     >
>> >         >
>> >           <https://www.packtpub.com/application-development/java-ee
>> -8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>>>
>> >         >     >
>> >         >     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <
>> relax@google.com <ma...@google.com> <mailto:relax@google.com
>> >         <ma...@google.com>>
>> >         >     >             <mailto:relax@google.com <mailto:
>> relax@google.com>
>> >         <mailto:relax@google.com <ma...@google.com>>>>:
>> >         >     >
>> >         >     >                 Unfortunately several runners (at least
>> Flink and Dataflow)
>> >         >     >                 support in-place update of streaming
>> pipelines as a key feature,
>> >         >     >                 and changing coder format breaks this.
>> This is a very important
>> >         >     >                 feature of both runners, and we should
>> endeavor not to break them.
>> >         >     >
>> >         >     >                 In-place snapshot and update is also a
>> top-level Beam proposal
>> >         >     >                 that was received positively, though
>> neither of those runners
>> >         >     >                 yet implement the proposed interface.
>> >         >     >
>> >         >     >                 Reuven
>> >         >     >
>> >         >     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain
>> Manni-Bucau
>> >         >     >                 <rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>> >         >     <mailto:rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
>> wrote:
>> >         >     >
>> >         >     >                     Sadly yes, and why the PR is
>> actually WIP. As mentionned it
>> >         >     >                     modifies it and requires some
>> updates in other languages and
>> >         >     >                     the standard_coders.yml file (I
>> didn't find how this file
>> >         >     >                     was generated).
>> >         >     >                     Since coders must be about volatile
>> data I don't think it is
>> >         >     >                     a big deal to change it though.
>> >         >     >
>> >         >     >
>> >         >     >                     Romain Manni-Bucau
>> >         >     >                     @rmannibucau <
>> https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>
>> >         >     <https://twitter.com/rmannibucau <
>> https://twitter.com/rmannibucau>>> |  Blog
>> >         >     >                     <https://rmannibucau.metawerx.net/
>> <https://rmannibucau.metawerx.net/>
>> >         >     <https://rmannibucau.metawerx.net/
>> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
>> >         >     >                     <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>
>> >         >     <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>>>
>> >         | Github
>> >         >     >                     <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>
>> >         >     <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>>> | LinkedIn
>> >         >     >                     <https://www.linkedin.com/in/
>> rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>
>> >         >     <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
>> >         >     >
>> >         >
>> >           <https://www.packtpub.com/application-development/java-ee
>> -8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>>>
>> >         >     >
>> >         >     >                     2018-02-04 17:34 GMT+01:00 Reuven
>> Lax <relax@google.com <ma...@google.com> <mailto:relax@google.com
>> >         <ma...@google.com>>
>> >         >     >                     <mailto:relax@google.com
>> >         <ma...@google.com> <mailto:relax@google.com
>> >         <ma...@google.com>>>>:
>> >         >     >
>> >         >     >                         One question - does this change
>> the actual byte encoding
>> >         >     >                         of elements? We've tried hard
>> not to do that so far for
>> >         >     >                         reasons of compatibility.
>> >         >     >
>> >         >     >                         Reuven
>> >         >     >
>> >         >     >                         On Sun, Feb 4, 2018 at 6:44 AM,
>> Romain Manni-Bucau
>> >         >     >                         <rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >         >     <mailto:rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>>
>> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>> >         >     <mailto:rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>>>>
>> >         >     >                         wrote:
>> >         >     >
>> >         >     >                             Hi guys,
>> >         >     >
>> >         >     >                             I submitted a PR on coders
>> to
>> >         enhance 1. the user
>> >         >     >                             experience 2. the
>> determinism and
>> >         handling of
>> >         >     coders.
>> >         >     >
>> >         >     >                             1. the user experience is
>> linked to
>> >         what i
>> >         >     sent some
>> >         >     >                             days ago: close handling of
>> the
>> >         streams from a
>> >         >     coder
>> >         >     >                             code. Long story short I
>> add a
>> >         SkipCloseCoder
>> >         >     which
>> >         >     >                             can decorate a coder and
>> just wraps
>> >         the stream
>> >         >     >                             (input or output) in flavors
>> >         skipping close()
>> >         >     calls.
>> >         >     >                             This avoids to do it by
>> default
>> >         (which had my
>> >         >     >                             preference if you read the
>> related
>> >         thread but not
>> >         >     >                             the one of everybody) but
>> also makes
>> >         the usage
>> >         >     of a
>> >         >     >                             coder with this issue easy
>> since the
>> >         of() of the
>> >         >     >                             coder just wraps itself in
>> this
>> >         delagating coder.
>> >         >     >
>> >         >     >                             2. this one is more nasty
>> and mainly
>> >         concerns
>> >         >     >                             IterableLikeCoders. These
>> ones use
>> >         this kind of
>> >         >     >                             algorithm (keep in mind
>> they work on
>> >         a list):
>> >         >     >
>> >         >     >                             writeSize()
>> >         >     >                             for all element e {
>> >         >     >                                 elementCoder.write(e)
>> >         >     >                             }
>> >         >     >                             writeMagicNumber() // this
>> one
>> >         depends the size
>> >         >     >
>> >         >     >                             The decoding is symmetric
>> so I
>> >         bypass it here.
>> >         >     >
>> >         >     >                             Indeed all these writes
>> (reads) are
>> >         done on
>> >         >     the same
>> >         >     >                             stream. Therefore it
>> assumes you
>> >         read as much
>> >         >     bytes
>> >         >     >                             than you write...which is a
>> huge
>> >         assumption for a
>> >         >     >                             coder which should by
>> contract
>> >         assume it can read
>> >         >     >                             the stream...as a stream
>> (until -1).
>> >         >     >
>> >         >     >                             The idea of the fix is to
>> change
>> >         this encoding to
>> >         >     >                             this kind of algorithm:
>> >         >     >
>> >         >     >                             writeSize()
>> >         >     >                             for all element e {
>> >         >     >                                 writeElementByteCount(e)
>> >         >     >                                 elementCoder.write(e)
>> >         >     >                             }
>> >         >     >                             writeMagicNumber() // still
>> optionally
>> >         >     >
>> >         >     >                             This way on the decode size
>> you can
>> >         wrap the
>> >         >     stream
>> >         >     >                             by element to enforce the
>> limitation
>> >         of the
>> >         >     byte count.
>> >         >     >
>> >         >     >                             Side note: this indeed
>> enforce a
>> >         limitation due to
>> >         >     >                             java byte limitation but if
>> you
>> >         check coder
>> >         >     code it
>> >         >     >                             is already here at the
>> higher level
>> >         so it is not a
>> >         >     >                             big deal for now.
>> >         >     >
>> >         >     >                             In terms of implementation
>> it uses a
>> >         >     >                             LengthAwareCoder which
>> delegates to
>> >         another coder
>> >         >     >                             the encoding and just adds
>> the byte
>> >         count
>> >         >     before the
>> >         >     >                             actual serialization. Not
>> perfect
>> >         but should
>> >         >     be more
>> >         >     >                             than enough in terms of
>> support and
>> >         perf for
>> >         >     beam if
>> >         >     >                             you think real pipelines
>> (we try to
>> >         avoid
>> >         >     >                             serializations or it is
>> done on some
>> >         well known
>> >         >     >                             points where this algo
>> should be
>> >         >     enough...worse case
>> >         >     >                             it is not a huge overhead,
>> mainly
>> >         just some memory
>> >         >     >                             overhead).
>> >         >     >
>> >         >     >
>> >         >     >                             The PR is available
>> >         >     >
>> >          at https://github.com/apache/beam/pull/4594
>> >         <https://github.com/apache/beam/pull/4594>
>> >         >     <https://github.com/apache/beam/pull/4594
>> >         <https://github.com/apache/beam/pull/4594>>. If you
>> >         >     >                             check you will see I put it
>> "WIP".
>> >         The main reason
>> >         >     >                             is that it changes the
>> encoding
>> >         format for
>> >         >     >                             containers (lists,
>> iterable, ...)
>> >         and therefore
>> >         >     >                             breaks python/go/... tests
>> and the
>> >         >     >                             standard_coders.yml
>> definition. Some
>> >         help on that
>> >         >     >                             would be very welcomed.
>> >         >     >
>> >         >     >                             Technical side note if you
>> >         >     >                             wonder: UnownedInputStream
>> doesn't
>> >         even allow to
>> >         >     >                             mark the stream so there is
>> no real
>> >         fast way
>> >         >     to read
>> >         >     >                             the stream as fast as
>> possible with
>> >         standard
>> >         >     >                             buffering strategies and to
>> support
>> >         this automatic
>> >         >     >                             IterableCoder wrapping
>> which is
>> >         implicit. In other
>> >         >     >                             words, if beam wants to
>> support any
>> >         coder,
>> >         >     including
>> >         >     >                             the ones not requiring to
>> write the
>> >         size of the
>> >         >     >                             output - most of the codecs
>> - then
>> >         we need to
>> >         >     change
>> >         >     >                             the way it works to
>> something like
>> >         that which does
>> >         >     >                             it for the user which
>> doesn't know
>> >         its coder got
>> >         >     >                             wrapped.
>> >         >     >
>> >         >     >                             Hope it makes sense, if
>> not, don't
>> >         hesitate to ask
>> >         >     >                             questions.
>> >         >     >
>> >         >     >                             Happy end of week-end.
>> >         >     >
>> >         >     >                             Romain Manni-Bucau
>> >         >     >                             @rmannibucau
>> >         <https://twitter.com/rmannibucau <https://twitter.com/rmannibuc
>> au>
>> >         >     <https://twitter.com/rmannibucau <
>> https://twitter.com/rmannibucau>>> |
>> >         >     >                              Blog <
>> https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
>> >         >     <https://rmannibucau.metawerx.net/
>> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
>> >         >     >                             <
>> http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>
>> >         >     <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>>>
>> >         | Github
>> >         >     >                             <
>> https://github.com/rmannibucau <https://github.com/rmannibucau>
>> >         >     <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>>> | LinkedIn
>> >         >     >                             <
>> https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>
>> >         >     <https://www.linkedin.com/in/rmannibucau
>> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
>> >         >     >
>> >         >
>> >           <https://www.packtpub.com/application-development/java-ee
>> -8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance
>> >         <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance>>>
>> >         >     >
>> >         >     >
>> >         >     >
>> >         >     >
>> >         >     >
>> >         >     >
>> >         >     >
>> >         >
>> >         >     --
>> >         >     Jean-Baptiste Onofré
>> >         >     jbonofre@apache.org <ma...@apache.org>
>> >         <mailto:jbonofre@apache.org <ma...@apache.org>>
>> >         >     http://blog.nanthrax.net
>> >         >     Talend - http://www.talend.com
>> >         >
>> >         >
>> >
>> >         --
>> >         Jean-Baptiste Onofré
>> >         jbonofre@apache.org <ma...@apache.org>
>> >         http://blog.nanthrax.net
>> >         Talend - http://www.talend.com
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Thanks, created https://issues.apache.org/jira/browse/BEAM-3616


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 22:12 GMT+01:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:

> Done
>
> Regards
> JB
>
> On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
> > Works for me. So a jira with target version = 3.
> >
> > Can someone with the karma check we have a 3.0.0 in jira system please?
> >
> > Le 4 févr. 2018 20:46, "Reuven Lax" <relax@google.com <mailto:
> relax@google.com>>
> > a écrit :
> >
> >     Seems fine to me. At some point we might want to do an audit of
> existing
> >     Jira issues, because I suspect there are issues that should be
> targeted to
> >     3.0 but are not yet tagged.
> >
> >     On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net
> >     <ma...@nanthrax.net>> wrote:
> >
> >         I would prefer to use Jira, with "wish"/"ideas", and adding Beam
> 3.0.0
> >         version.
> >
> >         WDYT ?
> >
> >         Regards
> >         JB
> >
> >         On 02/04/2018 07:55 PM, Reuven Lax wrote:
> >         > Do we have a good place to track the items for Beam 3.0, or is
> Jira the best
> >         > place? Romain has a good point - if this gets forgotten when
> we do Beam 3.0,
> >         > then we're stuck waiting around till Beam 4.0.
> >         >
> >         > Reuven
> >         >
> >         > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net <ma...@nanthrax.net>
> >         > <mailto:jb@nanthrax.net <ma...@nanthrax.net>>> wrote:
> >         >
> >         >     That's a good point. In the roadmap for Beam 3, I think it
> makes
> >         sense to add a
> >         >     point about this.
> >         >
> >         >     Regards
> >         >     JB
> >         >
> >         >     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> >         >     > I think doing a change that would break pipeline update
> for
> >         every single user of
> >         >     > Flink and Dataflow needs to be postponed until a next
> major
> >         version. Pipeline
> >         >     > update is a very frequently used feature, especially by
> the
> >         largest users. We've
> >         >     > had those users get significantly upset even when we
> >         accidentally broke update
> >         >     > compatibility for some special cases of individual
> transforms;
> >         breaking it
> >         >     > intentionally and project-wide is too extreme to be
> justified by
> >         the benefits of
> >         >     > the current change.
> >         >     >
> >         >     > That said, I think concerns about coder APIs are
> reasonable, and
> >         it is
> >         >     > unfortunate that we effectively can't make changes to
> them right
> >         now. It would
> >         >     > be great if in the next major version we were better
> prepared
> >         for evolution of
> >         >     > coders, e.g. by having coders support a version marker or
> >         something like that,
> >         >     > with an API for detecting the version of data on wire and
> >         reading or writing
> >         >     > data of an old version. Such a change (introducing
> versioning)
> >         would also, of
> >         >     > course, be incompatible and would need to be postponed
> until a
> >         major version -
> >         >     > but, at least, subsequent changes wouldn't.
> >         >     >
> >         >     > ...And as I was typing this email, seems that this is
> what the
> >         thread already
> >         >     > came to!
> >         >     >
> >         >     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau
> >         <rmannibucau@gmail.com <ma...@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >         >     > <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >         >     >
> >         >     >     I like this idea of migration support at coder
> level. It would require to
> >         >     >     add a metadata in all outputs which would represent
> the version then coders
> >         >     >     can handle the logic properly depending the version
> - we can assume a coder
> >         >     >     dev upgrade the version when he breaks the
> representation I hope ;).
> >         >     >     With this: no runner impact at all :).
> >         >     >
> >         >     >
> >         >     >     Romain Manni-Bucau
> >         >     >     @rmannibucau <https://twitter.com/rmannibucau <
> https://twitter.com/rmannibucau>
> >         >     <https://twitter.com/rmannibucau <https://twitter.com/
> rmannibucau>>> |  Blog
> >         >     >     <https://rmannibucau.metawerx.net/ <
> https://rmannibucau.metawerx.net/>
> >         >     <https://rmannibucau.metawerx.net/
> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
> >         >     >     <http://rmannibucau.wordpress.com
> >         <http://rmannibucau.wordpress.com> <
> http://rmannibucau.wordpress.com
> >         <http://rmannibucau.wordpress.com>>>
> >         >     | Github
> >         >     >     <https://github.com/rmannibucau
> >         <https://github.com/rmannibucau> <https://github.com/rmannibucau
> >         <https://github.com/rmannibucau>>> |
> >         >     LinkedIn
> >         >     >     <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>
> >         >     <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
> >         >     >
> >         >
> >           <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>>>
> >         >     >
> >         >     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <
> relax@google.com <ma...@google.com> <mailto:relax@google.com
> >         <ma...@google.com>>
> >         >     >     <mailto:relax@google.com <ma...@google.com>
> >         <mailto:relax@google.com <ma...@google.com>>>>:
> >         >     >
> >         >     >         It would already break quite a number of users
> at this point.
> >         >     >
> >         >     >         I think what we should be doing is moving
> forward on the snapshot/update
> >         >     >         proposal. That proposal actually provides a way
> forward when coders
> >         >     >         change (it proposes a way to map an old snapshot
> to one using the new
> >         >     >         coder, so changes to coders in the future will
> be much easier to make.
> >         >     >         However much of the implementation for this will
> likely be at the runner
> >         >     >         level, not the SDK level.
> >         >     >
> >         >     >         Reuven
> >         >     >
> >         >     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain
> Manni-Bucau
> >         >     >         <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >         >     <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >         >     >
> >         >     >             I fully understand that, and this is one of
> the reason managing to
> >         >     >             solve these issues is very important and
> ASAP. My conclusion is that
> >         >     >             we must break it now to avoid to do it later
> when usage will be way
> >         >     >             more developped - I would be very happy to
> be wrong on that point -
> >         >     >             so I started this PR and this thread. We can
> postpone it but it
> >         >     >             would break later so for probably more users.
> >         >     >
> >         >     >
> >         >     >             Romain Manni-Bucau
> >         >     >             @rmannibucau <https://twitter.com/
> rmannibucau <https://twitter.com/rmannibucau>
> >         >     <https://twitter.com/rmannibucau <https://twitter.com/
> rmannibucau>>> |  Blog
> >         >     >             <https://rmannibucau.metawerx.net/ <
> https://rmannibucau.metawerx.net/>
> >         >     <https://rmannibucau.metawerx.net/
> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
> >         >     >             <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>
> >         >     <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>>>
> >         | Github
> >         >     >             <https://github.com/rmannibucau <
> https://github.com/rmannibucau>
> >         >     <https://github.com/rmannibucau <https://github.com/
> rmannibucau>>> | LinkedIn
> >         >     >             <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>
> >         >     <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
> >         >     >
> >         >
> >           <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>>>
> >         >     >
> >         >     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <
> relax@google.com <ma...@google.com> <mailto:relax@google.com
> >         <ma...@google.com>>
> >         >     >             <mailto:relax@google.com <mailto:
> relax@google.com>
> >         <mailto:relax@google.com <ma...@google.com>>>>:
> >         >     >
> >         >     >                 Unfortunately several runners (at least
> Flink and Dataflow)
> >         >     >                 support in-place update of streaming
> pipelines as a key feature,
> >         >     >                 and changing coder format breaks this.
> This is a very important
> >         >     >                 feature of both runners, and we should
> endeavor not to break them.
> >         >     >
> >         >     >                 In-place snapshot and update is also a
> top-level Beam proposal
> >         >     >                 that was received positively, though
> neither of those runners
> >         >     >                 yet implement the proposed interface.
> >         >     >
> >         >     >                 Reuven
> >         >     >
> >         >     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain
> Manni-Bucau
> >         >     >                 <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >         >     <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >         >     >
> >         >     >                     Sadly yes, and why the PR is
> actually WIP. As mentionned it
> >         >     >                     modifies it and requires some
> updates in other languages and
> >         >     >                     the standard_coders.yml file (I
> didn't find how this file
> >         >     >                     was generated).
> >         >     >                     Since coders must be about volatile
> data I don't think it is
> >         >     >                     a big deal to change it though.
> >         >     >
> >         >     >
> >         >     >                     Romain Manni-Bucau
> >         >     >                     @rmannibucau <https://twitter.com/
> rmannibucau <https://twitter.com/rmannibucau>
> >         >     <https://twitter.com/rmannibucau <https://twitter.com/
> rmannibucau>>> |  Blog
> >         >     >                     <https://rmannibucau.metawerx.net/ <
> https://rmannibucau.metawerx.net/>
> >         >     <https://rmannibucau.metawerx.net/
> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
> >         >     >                     <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>
> >         >     <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>>>
> >         | Github
> >         >     >                     <https://github.com/rmannibucau <
> https://github.com/rmannibucau>
> >         >     <https://github.com/rmannibucau <https://github.com/
> rmannibucau>>> | LinkedIn
> >         >     >                     <https://www.linkedin.com/in/
> rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>
> >         >     <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
> >         >     >
> >         >
> >           <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>>>
> >         >     >
> >         >     >                     2018-02-04 17:34 GMT+01:00 Reuven
> Lax <relax@google.com <ma...@google.com> <mailto:relax@google.com
> >         <ma...@google.com>>
> >         >     >                     <mailto:relax@google.com
> >         <ma...@google.com> <mailto:relax@google.com
> >         <ma...@google.com>>>>:
> >         >     >
> >         >     >                         One question - does this change
> the actual byte encoding
> >         >     >                         of elements? We've tried hard
> not to do that so far for
> >         >     >                         reasons of compatibility.
> >         >     >
> >         >     >                         Reuven
> >         >     >
> >         >     >                         On Sun, Feb 4, 2018 at 6:44 AM,
> Romain Manni-Bucau
> >         >     >                         <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >         >     <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>>
> >         <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >         >     <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>>>>
> >         >     >                         wrote:
> >         >     >
> >         >     >                             Hi guys,
> >         >     >
> >         >     >                             I submitted a PR on coders to
> >         enhance 1. the user
> >         >     >                             experience 2. the
> determinism and
> >         handling of
> >         >     coders.
> >         >     >
> >         >     >                             1. the user experience is
> linked to
> >         what i
> >         >     sent some
> >         >     >                             days ago: close handling of
> the
> >         streams from a
> >         >     coder
> >         >     >                             code. Long story short I add
> a
> >         SkipCloseCoder
> >         >     which
> >         >     >                             can decorate a coder and
> just wraps
> >         the stream
> >         >     >                             (input or output) in flavors
> >         skipping close()
> >         >     calls.
> >         >     >                             This avoids to do it by
> default
> >         (which had my
> >         >     >                             preference if you read the
> related
> >         thread but not
> >         >     >                             the one of everybody) but
> also makes
> >         the usage
> >         >     of a
> >         >     >                             coder with this issue easy
> since the
> >         of() of the
> >         >     >                             coder just wraps itself in
> this
> >         delagating coder.
> >         >     >
> >         >     >                             2. this one is more nasty
> and mainly
> >         concerns
> >         >     >                             IterableLikeCoders. These
> ones use
> >         this kind of
> >         >     >                             algorithm (keep in mind they
> work on
> >         a list):
> >         >     >
> >         >     >                             writeSize()
> >         >     >                             for all element e {
> >         >     >                                 elementCoder.write(e)
> >         >     >                             }
> >         >     >                             writeMagicNumber() // this
> one
> >         depends the size
> >         >     >
> >         >     >                             The decoding is symmetric so
> I
> >         bypass it here.
> >         >     >
> >         >     >                             Indeed all these writes
> (reads) are
> >         done on
> >         >     the same
> >         >     >                             stream. Therefore it assumes
> you
> >         read as much
> >         >     bytes
> >         >     >                             than you write...which is a
> huge
> >         assumption for a
> >         >     >                             coder which should by
> contract
> >         assume it can read
> >         >     >                             the stream...as a stream
> (until -1).
> >         >     >
> >         >     >                             The idea of the fix is to
> change
> >         this encoding to
> >         >     >                             this kind of algorithm:
> >         >     >
> >         >     >                             writeSize()
> >         >     >                             for all element e {
> >         >     >                                 writeElementByteCount(e)
> >         >     >                                 elementCoder.write(e)
> >         >     >                             }
> >         >     >                             writeMagicNumber() // still
> optionally
> >         >     >
> >         >     >                             This way on the decode size
> you can
> >         wrap the
> >         >     stream
> >         >     >                             by element to enforce the
> limitation
> >         of the
> >         >     byte count.
> >         >     >
> >         >     >                             Side note: this indeed
> enforce a
> >         limitation due to
> >         >     >                             java byte limitation but if
> you
> >         check coder
> >         >     code it
> >         >     >                             is already here at the
> higher level
> >         so it is not a
> >         >     >                             big deal for now.
> >         >     >
> >         >     >                             In terms of implementation
> it uses a
> >         >     >                             LengthAwareCoder which
> delegates to
> >         another coder
> >         >     >                             the encoding and just adds
> the byte
> >         count
> >         >     before the
> >         >     >                             actual serialization. Not
> perfect
> >         but should
> >         >     be more
> >         >     >                             than enough in terms of
> support and
> >         perf for
> >         >     beam if
> >         >     >                             you think real pipelines (we
> try to
> >         avoid
> >         >     >                             serializations or it is done
> on some
> >         well known
> >         >     >                             points where this algo
> should be
> >         >     enough...worse case
> >         >     >                             it is not a huge overhead,
> mainly
> >         just some memory
> >         >     >                             overhead).
> >         >     >
> >         >     >
> >         >     >                             The PR is available
> >         >     >
> >          at https://github.com/apache/beam/pull/4594
> >         <https://github.com/apache/beam/pull/4594>
> >         >     <https://github.com/apache/beam/pull/4594
> >         <https://github.com/apache/beam/pull/4594>>. If you
> >         >     >                             check you will see I put it
> "WIP".
> >         The main reason
> >         >     >                             is that it changes the
> encoding
> >         format for
> >         >     >                             containers (lists, iterable,
> ...)
> >         and therefore
> >         >     >                             breaks python/go/... tests
> and the
> >         >     >                             standard_coders.yml
> definition. Some
> >         help on that
> >         >     >                             would be very welcomed.
> >         >     >
> >         >     >                             Technical side note if you
> >         >     >                             wonder: UnownedInputStream
> doesn't
> >         even allow to
> >         >     >                             mark the stream so there is
> no real
> >         fast way
> >         >     to read
> >         >     >                             the stream as fast as
> possible with
> >         standard
> >         >     >                             buffering strategies and to
> support
> >         this automatic
> >         >     >                             IterableCoder wrapping which
> is
> >         implicit. In other
> >         >     >                             words, if beam wants to
> support any
> >         coder,
> >         >     including
> >         >     >                             the ones not requiring to
> write the
> >         size of the
> >         >     >                             output - most of the codecs
> - then
> >         we need to
> >         >     change
> >         >     >                             the way it works to
> something like
> >         that which does
> >         >     >                             it for the user which
> doesn't know
> >         its coder got
> >         >     >                             wrapped.
> >         >     >
> >         >     >                             Hope it makes sense, if not,
> don't
> >         hesitate to ask
> >         >     >                             questions.
> >         >     >
> >         >     >                             Happy end of week-end.
> >         >     >
> >         >     >                             Romain Manni-Bucau
> >         >     >                             @rmannibucau
> >         <https://twitter.com/rmannibucau <https://twitter.com/
> rmannibucau>
> >         >     <https://twitter.com/rmannibucau <https://twitter.com/
> rmannibucau>>> |
> >         >     >                              Blog <
> https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
> >         >     <https://rmannibucau.metawerx.net/
> >         <https://rmannibucau.metawerx.net/>>> | Old Blog
> >         >     >                             <http://rmannibucau.
> wordpress.com <http://rmannibucau.wordpress.com>
> >         >     <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>>>
> >         | Github
> >         >     >                             <https://github.com/
> rmannibucau <https://github.com/rmannibucau>
> >         >     <https://github.com/rmannibucau <https://github.com/
> rmannibucau>>> | LinkedIn
> >         >     >                             <
> https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>
> >         >     <https://www.linkedin.com/in/rmannibucau
> >         <https://www.linkedin.com/in/rmannibucau>>> | Book
> >         >     >
> >         >
> >           <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance
> >         <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>>>
> >         >     >
> >         >     >
> >         >     >
> >         >     >
> >         >     >
> >         >     >
> >         >     >
> >         >
> >         >     --
> >         >     Jean-Baptiste Onofré
> >         >     jbonofre@apache.org <ma...@apache.org>
> >         <mailto:jbonofre@apache.org <ma...@apache.org>>
> >         >     http://blog.nanthrax.net
> >         >     Talend - http://www.talend.com
> >         >
> >         >
> >
> >         --
> >         Jean-Baptiste Onofré
> >         jbonofre@apache.org <ma...@apache.org>
> >         http://blog.nanthrax.net
> >         Talend - http://www.talend.com
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: coder evolutions?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Done

Regards
JB

On 02/04/2018 09:14 PM, Romain Manni-Bucau wrote:
> Works for me. So a jira with target version = 3.
> 
> Can someone with the karma check we have a 3.0.0 in jira system please?
> 
> Le 4 févr. 2018 20:46, "Reuven Lax" <relax@google.com <ma...@google.com>>
> a écrit :
> 
>     Seems fine to me. At some point we might want to do an audit of existing
>     Jira issues, because I suspect there are issues that should be targeted to
>     3.0 but are not yet tagged.
> 
>     On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>     <ma...@nanthrax.net>> wrote:
> 
>         I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0
>         version.
> 
>         WDYT ?
> 
>         Regards
>         JB
> 
>         On 02/04/2018 07:55 PM, Reuven Lax wrote:
>         > Do we have a good place to track the items for Beam 3.0, or is Jira the best
>         > place? Romain has a good point - if this gets forgotten when we do Beam 3.0,
>         > then we're stuck waiting around till Beam 4.0.
>         >
>         > Reuven
>         >
>         > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <jb@nanthrax.net <ma...@nanthrax.net>
>         > <mailto:jb@nanthrax.net <ma...@nanthrax.net>>> wrote:
>         >
>         >     That's a good point. In the roadmap for Beam 3, I think it makes
>         sense to add a
>         >     point about this.
>         >
>         >     Regards
>         >     JB
>         >
>         >     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>         >     > I think doing a change that would break pipeline update for
>         every single user of
>         >     > Flink and Dataflow needs to be postponed until a next major
>         version. Pipeline
>         >     > update is a very frequently used feature, especially by the
>         largest users. We've
>         >     > had those users get significantly upset even when we
>         accidentally broke update
>         >     > compatibility for some special cases of individual transforms;
>         breaking it
>         >     > intentionally and project-wide is too extreme to be justified by
>         the benefits of
>         >     > the current change.
>         >     >
>         >     > That said, I think concerns about coder APIs are reasonable, and
>         it is
>         >     > unfortunate that we effectively can't make changes to them right
>         now. It would
>         >     > be great if in the next major version we were better prepared
>         for evolution of
>         >     > coders, e.g. by having coders support a version marker or
>         something like that,
>         >     > with an API for detecting the version of data on wire and
>         reading or writing
>         >     > data of an old version. Such a change (introducing versioning)
>         would also, of
>         >     > course, be incompatible and would need to be postponed until a
>         major version -
>         >     > but, at least, subsequent changes wouldn't.
>         >     >
>         >     > ...And as I was typing this email, seems that this is what the
>         thread already
>         >     > came to!
>         >     >
>         >     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau
>         <rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>         >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>         >     >
>         >     >     I like this idea of migration support at coder level. It would require to
>         >     >     add a metadata in all outputs which would represent the version then coders
>         >     >     can handle the logic properly depending the version - we can assume a coder
>         >     >     dev upgrade the version when he breaks the representation I hope ;).
>         >     >     With this: no runner impact at all :).
>         >     >
>         >     >
>         >     >     Romain Manni-Bucau
>         >     >     @rmannibucau <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>
>         >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>>> |  Blog
>         >     >     <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
>         >     <https://rmannibucau.metawerx.net/
>         <https://rmannibucau.metawerx.net/>>> | Old Blog
>         >     >     <http://rmannibucau.wordpress.com
>         <http://rmannibucau.wordpress.com> <http://rmannibucau.wordpress.com
>         <http://rmannibucau.wordpress.com>>>
>         >     | Github
>         >     >     <https://github.com/rmannibucau
>         <https://github.com/rmannibucau> <https://github.com/rmannibucau
>         <https://github.com/rmannibucau>>> |
>         >     LinkedIn
>         >     >     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>
>         >     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>>> | Book
>         >     >   
>         >   
>           <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>>>
>         >     >
>         >     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com> <mailto:relax@google.com
>         <ma...@google.com>>
>         >     >     <mailto:relax@google.com <ma...@google.com>
>         <mailto:relax@google.com <ma...@google.com>>>>:
>         >     >
>         >     >         It would already break quite a number of users at this point.
>         >     >
>         >     >         I think what we should be doing is moving forward on the snapshot/update
>         >     >         proposal. That proposal actually provides a way forward when coders
>         >     >         change (it proposes a way to map an old snapshot to one using the new
>         >     >         coder, so changes to coders in the future will be much easier to make.
>         >     >         However much of the implementation for this will likely be at the runner
>         >     >         level, not the SDK level.
>         >     >
>         >     >         Reuven
>         >     >
>         >     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
>         >     >         <rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>         >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>         >     >
>         >     >             I fully understand that, and this is one of the reason managing to
>         >     >             solve these issues is very important and ASAP. My conclusion is that
>         >     >             we must break it now to avoid to do it later when usage will be way
>         >     >             more developped - I would be very happy to be wrong on that point -
>         >     >             so I started this PR and this thread. We can postpone it but it
>         >     >             would break later so for probably more users.
>         >     >
>         >     >
>         >     >             Romain Manni-Bucau
>         >     >             @rmannibucau <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>
>         >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>>> |  Blog
>         >     >             <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
>         >     <https://rmannibucau.metawerx.net/
>         <https://rmannibucau.metawerx.net/>>> | Old Blog
>         >     >             <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>
>         >     <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>>>
>         | Github
>         >     >             <https://github.com/rmannibucau <https://github.com/rmannibucau>
>         >     <https://github.com/rmannibucau <https://github.com/rmannibucau>>> | LinkedIn
>         >     >             <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>
>         >     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>>> | Book
>         >     >           
>         >   
>           <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>>>
>         >     >
>         >     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com> <mailto:relax@google.com
>         <ma...@google.com>>
>         >     >             <mailto:relax@google.com <ma...@google.com>
>         <mailto:relax@google.com <ma...@google.com>>>>:
>         >     >
>         >     >                 Unfortunately several runners (at least Flink and Dataflow)
>         >     >                 support in-place update of streaming pipelines as a key feature,
>         >     >                 and changing coder format breaks this. This is a very important
>         >     >                 feature of both runners, and we should endeavor not to break them.
>         >     >
>         >     >                 In-place snapshot and update is also a top-level Beam proposal
>         >     >                 that was received positively, though neither of those runners
>         >     >                 yet implement the proposed interface.
>         >     >
>         >     >                 Reuven
>         >     >
>         >     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
>         >     >                 <rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>         >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>         >     >
>         >     >                     Sadly yes, and why the PR is actually WIP. As mentionned it
>         >     >                     modifies it and requires some updates in other languages and
>         >     >                     the standard_coders.yml file (I didn't find how this file
>         >     >                     was generated).
>         >     >                     Since coders must be about volatile data I don't think it is
>         >     >                     a big deal to change it though.
>         >     >
>         >     >
>         >     >                     Romain Manni-Bucau
>         >     >                     @rmannibucau <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>
>         >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>>> |  Blog
>         >     >                     <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
>         >     <https://rmannibucau.metawerx.net/
>         <https://rmannibucau.metawerx.net/>>> | Old Blog
>         >     >                     <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>
>         >     <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>>>
>         | Github
>         >     >                     <https://github.com/rmannibucau <https://github.com/rmannibucau>
>         >     <https://github.com/rmannibucau <https://github.com/rmannibucau>>> | LinkedIn
>         >     >                     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>
>         >     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>>> | Book
>         >     >                   
>         >   
>           <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>>>
>         >     >
>         >     >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com> <mailto:relax@google.com
>         <ma...@google.com>>
>         >     >                     <mailto:relax@google.com
>         <ma...@google.com> <mailto:relax@google.com
>         <ma...@google.com>>>>:
>         >     >
>         >     >                         One question - does this change the actual byte encoding
>         >     >                         of elements? We've tried hard not to do that so far for
>         >     >                         reasons of compatibility.
>         >     >
>         >     >                         Reuven
>         >     >
>         >     >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>         >     >                         <rmannibucau@gmail.com <ma...@gmail.com>
>         >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>         <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>         >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
>         >     >                         wrote:
>         >     >
>         >     >                             Hi guys,
>         >     >
>         >     >                             I submitted a PR on coders to
>         enhance 1. the user
>         >     >                             experience 2. the determinism and
>         handling of
>         >     coders.
>         >     >
>         >     >                             1. the user experience is linked to
>         what i
>         >     sent some
>         >     >                             days ago: close handling of the
>         streams from a
>         >     coder
>         >     >                             code. Long story short I add a
>         SkipCloseCoder
>         >     which
>         >     >                             can decorate a coder and just wraps
>         the stream
>         >     >                             (input or output) in flavors
>         skipping close()
>         >     calls.
>         >     >                             This avoids to do it by default
>         (which had my
>         >     >                             preference if you read the related
>         thread but not
>         >     >                             the one of everybody) but also makes
>         the usage
>         >     of a
>         >     >                             coder with this issue easy since the
>         of() of the
>         >     >                             coder just wraps itself in this
>         delagating coder.
>         >     >
>         >     >                             2. this one is more nasty and mainly
>         concerns
>         >     >                             IterableLikeCoders. These ones use
>         this kind of
>         >     >                             algorithm (keep in mind they work on
>         a list):
>         >     >
>         >     >                             writeSize()
>         >     >                             for all element e {
>         >     >                                 elementCoder.write(e)
>         >     >                             }
>         >     >                             writeMagicNumber() // this one
>         depends the size
>         >     >
>         >     >                             The decoding is symmetric so I
>         bypass it here.
>         >     >
>         >     >                             Indeed all these writes (reads) are
>         done on
>         >     the same
>         >     >                             stream. Therefore it assumes you
>         read as much
>         >     bytes
>         >     >                             than you write...which is a huge
>         assumption for a
>         >     >                             coder which should by contract
>         assume it can read
>         >     >                             the stream...as a stream (until -1).
>         >     >
>         >     >                             The idea of the fix is to change
>         this encoding to
>         >     >                             this kind of algorithm:
>         >     >
>         >     >                             writeSize()
>         >     >                             for all element e {
>         >     >                                 writeElementByteCount(e)
>         >     >                                 elementCoder.write(e)
>         >     >                             }
>         >     >                             writeMagicNumber() // still optionally
>         >     >
>         >     >                             This way on the decode size you can
>         wrap the
>         >     stream
>         >     >                             by element to enforce the limitation
>         of the
>         >     byte count.
>         >     >
>         >     >                             Side note: this indeed enforce a
>         limitation due to
>         >     >                             java byte limitation but if you
>         check coder
>         >     code it
>         >     >                             is already here at the higher level
>         so it is not a
>         >     >                             big deal for now.
>         >     >
>         >     >                             In terms of implementation it uses a
>         >     >                             LengthAwareCoder which delegates to
>         another coder
>         >     >                             the encoding and just adds the byte
>         count
>         >     before the
>         >     >                             actual serialization. Not perfect
>         but should
>         >     be more
>         >     >                             than enough in terms of support and
>         perf for
>         >     beam if
>         >     >                             you think real pipelines (we try to
>         avoid
>         >     >                             serializations or it is done on some
>         well known
>         >     >                             points where this algo should be
>         >     enough...worse case
>         >     >                             it is not a huge overhead, mainly
>         just some memory
>         >     >                             overhead).
>         >     >
>         >     >
>         >     >                             The PR is available
>         >     >                           
>          at https://github.com/apache/beam/pull/4594
>         <https://github.com/apache/beam/pull/4594>
>         >     <https://github.com/apache/beam/pull/4594
>         <https://github.com/apache/beam/pull/4594>>. If you
>         >     >                             check you will see I put it "WIP".
>         The main reason
>         >     >                             is that it changes the encoding
>         format for
>         >     >                             containers (lists, iterable, ...)
>         and therefore
>         >     >                             breaks python/go/... tests and the
>         >     >                             standard_coders.yml definition. Some
>         help on that
>         >     >                             would be very welcomed.
>         >     >
>         >     >                             Technical side note if you
>         >     >                             wonder: UnownedInputStream doesn't
>         even allow to
>         >     >                             mark the stream so there is no real
>         fast way
>         >     to read
>         >     >                             the stream as fast as possible with
>         standard
>         >     >                             buffering strategies and to support
>         this automatic
>         >     >                             IterableCoder wrapping which is
>         implicit. In other
>         >     >                             words, if beam wants to support any
>         coder,
>         >     including
>         >     >                             the ones not requiring to write the
>         size of the
>         >     >                             output - most of the codecs - then
>         we need to
>         >     change
>         >     >                             the way it works to something like
>         that which does
>         >     >                             it for the user which doesn't know
>         its coder got
>         >     >                             wrapped.
>         >     >
>         >     >                             Hope it makes sense, if not, don't
>         hesitate to ask
>         >     >                             questions.
>         >     >
>         >     >                             Happy end of week-end.
>         >     >
>         >     >                             Romain Manni-Bucau
>         >     >                             @rmannibucau
>         <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>
>         >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>>> |
>         >     >                              Blog <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>
>         >     <https://rmannibucau.metawerx.net/
>         <https://rmannibucau.metawerx.net/>>> | Old Blog
>         >     >                             <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>
>         >     <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>>>
>         | Github
>         >     >                             <https://github.com/rmannibucau <https://github.com/rmannibucau>
>         >     <https://github.com/rmannibucau <https://github.com/rmannibucau>>> | LinkedIn
>         >     >                             <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>
>         >     <https://www.linkedin.com/in/rmannibucau
>         <https://www.linkedin.com/in/rmannibucau>>> | Book
>         >     >                           
>         >   
>           <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>>>
>         >     >
>         >     >
>         >     >
>         >     >
>         >     >
>         >     >
>         >     >
>         >
>         >     --
>         >     Jean-Baptiste Onofré
>         >     jbonofre@apache.org <ma...@apache.org>
>         <mailto:jbonofre@apache.org <ma...@apache.org>>
>         >     http://blog.nanthrax.net
>         >     Talend - http://www.talend.com
>         >
>         >
> 
>         --
>         Jean-Baptiste Onofré
>         jbonofre@apache.org <ma...@apache.org>
>         http://blog.nanthrax.net
>         Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Works for me. So a jira with target version = 3.

Can someone with the karma check we have a 3.0.0 in jira system please?

Le 4 févr. 2018 20:46, "Reuven Lax" <re...@google.com> a écrit :

> Seems fine to me. At some point we might want to do an audit of existing
> Jira issues, because I suspect there are issues that should be targeted to
> 3.0 but are not yet tagged.
>
> On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0
>> version.
>>
>> WDYT ?
>>
>> Regards
>> JB
>>
>> On 02/04/2018 07:55 PM, Reuven Lax wrote:
>> > Do we have a good place to track the items for Beam 3.0, or is Jira the
>> best
>> > place? Romain has a good point - if this gets forgotten when we do Beam
>> 3.0,
>> > then we're stuck waiting around till Beam 4.0.
>> >
>> > Reuven
>> >
>> > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>> > <ma...@nanthrax.net>> wrote:
>> >
>> >     That's a good point. In the roadmap for Beam 3, I think it makes
>> sense to add a
>> >     point about this.
>> >
>> >     Regards
>> >     JB
>> >
>> >     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>> >     > I think doing a change that would break pipeline update for every
>> single user of
>> >     > Flink and Dataflow needs to be postponed until a next major
>> version. Pipeline
>> >     > update is a very frequently used feature, especially by the
>> largest users. We've
>> >     > had those users get significantly upset even when we accidentally
>> broke update
>> >     > compatibility for some special cases of individual transforms;
>> breaking it
>> >     > intentionally and project-wide is too extreme to be justified by
>> the benefits of
>> >     > the current change.
>> >     >
>> >     > That said, I think concerns about coder APIs are reasonable, and
>> it is
>> >     > unfortunate that we effectively can't make changes to them right
>> now. It would
>> >     > be great if in the next major version we were better prepared for
>> evolution of
>> >     > coders, e.g. by having coders support a version marker or
>> something like that,
>> >     > with an API for detecting the version of data on wire and reading
>> or writing
>> >     > data of an old version. Such a change (introducing versioning)
>> would also, of
>> >     > course, be incompatible and would need to be postponed until a
>> major version -
>> >     > but, at least, subsequent changes wouldn't.
>> >     >
>> >     > ...And as I was typing this email, seems that this is what the
>> thread already
>> >     > came to!
>> >     >
>> >     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com <ma...@gmail.com>
>> >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
>> wrote:
>> >     >
>> >     >     I like this idea of migration support at coder level. It
>> would require to
>> >     >     add a metadata in all outputs which would represent the
>> version then coders
>> >     >     can handle the logic properly depending the version - we can
>> assume a coder
>> >     >     dev upgrade the version when he breaks the representation I
>> hope ;).
>> >     >     With this: no runner impact at all :).
>> >     >
>> >     >
>> >     >     Romain Manni-Bucau
>> >     >     @rmannibucau <https://twitter.com/rmannibucau
>> >     <https://twitter.com/rmannibucau>> |  Blog
>> >     >     <https://rmannibucau.metawerx.net/
>> >     <https://rmannibucau.metawerx.net/>> | Old Blog
>> >     >     <http://rmannibucau.wordpress.com <
>> http://rmannibucau.wordpress.com>>
>> >     | Github
>> >     >     <https://github.com/rmannibucau <
>> https://github.com/rmannibucau>> |
>> >     LinkedIn
>> >     >     <https://www.linkedin.com/in/rmannibucau
>> >     <https://www.linkedin.com/in/rmannibucau>> | Book
>> >     >
>> >      <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance <https://www.packtpub.com/appl
>> ication-development/java-ee-8-high-performance>>
>> >     >
>> >     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com
>> <ma...@google.com>
>> >     >     <mailto:relax@google.com <ma...@google.com>>>:
>> >     >
>> >     >         It would already break quite a number of users at this
>> point.
>> >     >
>> >     >         I think what we should be doing is moving forward on the
>> snapshot/update
>> >     >         proposal. That proposal actually provides a way forward
>> when coders
>> >     >         change (it proposes a way to map an old snapshot to one
>> using the new
>> >     >         coder, so changes to coders in the future will be much
>> easier to make.
>> >     >         However much of the implementation for this will likely
>> be at the runner
>> >     >         level, not the SDK level.
>> >     >
>> >     >         Reuven
>> >     >
>> >     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
>> >     >         <rmannibucau@gmail.com <ma...@gmail.com>
>> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
>> wrote:
>> >     >
>> >     >             I fully understand that, and this is one of the
>> reason managing to
>> >     >             solve these issues is very important and ASAP. My
>> conclusion is that
>> >     >             we must break it now to avoid to do it later when
>> usage will be way
>> >     >             more developped - I would be very happy to be wrong
>> on that point -
>> >     >             so I started this PR and this thread. We can postpone
>> it but it
>> >     >             would break later so for probably more users.
>> >     >
>> >     >
>> >     >             Romain Manni-Bucau
>> >     >             @rmannibucau <https://twitter.com/rmannibucau
>> >     <https://twitter.com/rmannibucau>> |  Blog
>> >     >             <https://rmannibucau.metawerx.net/
>> >     <https://rmannibucau.metawerx.net/>> | Old Blog
>> >     >             <http://rmannibucau.wordpress.com
>> >     <http://rmannibucau.wordpress.com>> | Github
>> >     >             <https://github.com/rmannibucau
>> >     <https://github.com/rmannibucau>> | LinkedIn
>> >     >             <https://www.linkedin.com/in/rmannibucau
>> >     <https://www.linkedin.com/in/rmannibucau>> | Book
>> >     >
>> >      <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance <https://www.packtpub.com/appl
>> ication-development/java-ee-8-high-performance>>
>> >     >
>> >     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <
>> relax@google.com <ma...@google.com>
>> >     >             <mailto:relax@google.com <ma...@google.com>>>:
>> >     >
>> >     >                 Unfortunately several runners (at least Flink and
>> Dataflow)
>> >     >                 support in-place update of streaming pipelines as
>> a key feature,
>> >     >                 and changing coder format breaks this. This is a
>> very important
>> >     >                 feature of both runners, and we should endeavor
>> not to break them.
>> >     >
>> >     >                 In-place snapshot and update is also a top-level
>> Beam proposal
>> >     >                 that was received positively, though neither of
>> those runners
>> >     >                 yet implement the proposed interface.
>> >     >
>> >     >                 Reuven
>> >     >
>> >     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
>> >     >                 <rmannibucau@gmail.com <mailto:
>> rmannibucau@gmail.com>
>> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
>> wrote:
>> >     >
>> >     >                     Sadly yes, and why the PR is actually WIP. As
>> mentionned it
>> >     >                     modifies it and requires some updates in
>> other languages and
>> >     >                     the standard_coders.yml file (I didn't find
>> how this file
>> >     >                     was generated).
>> >     >                     Since coders must be about volatile data I
>> don't think it is
>> >     >                     a big deal to change it though.
>> >     >
>> >     >
>> >     >                     Romain Manni-Bucau
>> >     >                     @rmannibucau <https://twitter.com/rmannibucau
>> >     <https://twitter.com/rmannibucau>> |  Blog
>> >     >                     <https://rmannibucau.metawerx.net/
>> >     <https://rmannibucau.metawerx.net/>> | Old Blog
>> >     >                     <http://rmannibucau.wordpress.com
>> >     <http://rmannibucau.wordpress.com>> | Github
>> >     >                     <https://github.com/rmannibucau
>> >     <https://github.com/rmannibucau>> | LinkedIn
>> >     >                     <https://www.linkedin.com/in/rmannibucau
>> >     <https://www.linkedin.com/in/rmannibucau>> | Book
>> >     >
>> >      <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance <https://www.packtpub.com/appl
>> ication-development/java-ee-8-high-performance>>
>> >     >
>> >     >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <
>> relax@google.com <ma...@google.com>
>> >     >                     <mailto:relax@google.com <mailto:
>> relax@google.com>>>:
>> >     >
>> >     >                         One question - does this change the
>> actual byte encoding
>> >     >                         of elements? We've tried hard not to do
>> that so far for
>> >     >                         reasons of compatibility.
>> >     >
>> >     >                         Reuven
>> >     >
>> >     >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain
>> Manni-Bucau
>> >     >                         <rmannibucau@gmail.com
>> >     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
>> >     <ma...@gmail.com>>>
>> >     >                         wrote:
>> >     >
>> >     >                             Hi guys,
>> >     >
>> >     >                             I submitted a PR on coders to enhance
>> 1. the user
>> >     >                             experience 2. the determinism and
>> handling of
>> >     coders.
>> >     >
>> >     >                             1. the user experience is linked to
>> what i
>> >     sent some
>> >     >                             days ago: close handling of the
>> streams from a
>> >     coder
>> >     >                             code. Long story short I add a
>> SkipCloseCoder
>> >     which
>> >     >                             can decorate a coder and just wraps
>> the stream
>> >     >                             (input or output) in flavors skipping
>> close()
>> >     calls.
>> >     >                             This avoids to do it by default
>> (which had my
>> >     >                             preference if you read the related
>> thread but not
>> >     >                             the one of everybody) but also makes
>> the usage
>> >     of a
>> >     >                             coder with this issue easy since the
>> of() of the
>> >     >                             coder just wraps itself in this
>> delagating coder.
>> >     >
>> >     >                             2. this one is more nasty and mainly
>> concerns
>> >     >                             IterableLikeCoders. These ones use
>> this kind of
>> >     >                             algorithm (keep in mind they work on
>> a list):
>> >     >
>> >     >                             writeSize()
>> >     >                             for all element e {
>> >     >                                 elementCoder.write(e)
>> >     >                             }
>> >     >                             writeMagicNumber() // this one
>> depends the size
>> >     >
>> >     >                             The decoding is symmetric so I bypass
>> it here.
>> >     >
>> >     >                             Indeed all these writes (reads) are
>> done on
>> >     the same
>> >     >                             stream. Therefore it assumes you read
>> as much
>> >     bytes
>> >     >                             than you write...which is a huge
>> assumption for a
>> >     >                             coder which should by contract assume
>> it can read
>> >     >                             the stream...as a stream (until -1).
>> >     >
>> >     >                             The idea of the fix is to change this
>> encoding to
>> >     >                             this kind of algorithm:
>> >     >
>> >     >                             writeSize()
>> >     >                             for all element e {
>> >     >                                 writeElementByteCount(e)
>> >     >                                 elementCoder.write(e)
>> >     >                             }
>> >     >                             writeMagicNumber() // still optionally
>> >     >
>> >     >                             This way on the decode size you can
>> wrap the
>> >     stream
>> >     >                             by element to enforce the limitation
>> of the
>> >     byte count.
>> >     >
>> >     >                             Side note: this indeed enforce a
>> limitation due to
>> >     >                             java byte limitation but if you check
>> coder
>> >     code it
>> >     >                             is already here at the higher level
>> so it is not a
>> >     >                             big deal for now.
>> >     >
>> >     >                             In terms of implementation it uses a
>> >     >                             LengthAwareCoder which delegates to
>> another coder
>> >     >                             the encoding and just adds the byte
>> count
>> >     before the
>> >     >                             actual serialization. Not perfect but
>> should
>> >     be more
>> >     >                             than enough in terms of support and
>> perf for
>> >     beam if
>> >     >                             you think real pipelines (we try to
>> avoid
>> >     >                             serializations or it is done on some
>> well known
>> >     >                             points where this algo should be
>> >     enough...worse case
>> >     >                             it is not a huge overhead, mainly
>> just some memory
>> >     >                             overhead).
>> >     >
>> >     >
>> >     >                             The PR is available
>> >     >                             at https://github.com/apache/
>> beam/pull/4594
>> >     <https://github.com/apache/beam/pull/4594>. If you
>> >     >                             check you will see I put it "WIP".
>> The main reason
>> >     >                             is that it changes the encoding
>> format for
>> >     >                             containers (lists, iterable, ...) and
>> therefore
>> >     >                             breaks python/go/... tests and the
>> >     >                             standard_coders.yml definition. Some
>> help on that
>> >     >                             would be very welcomed.
>> >     >
>> >     >                             Technical side note if you
>> >     >                             wonder: UnownedInputStream doesn't
>> even allow to
>> >     >                             mark the stream so there is no real
>> fast way
>> >     to read
>> >     >                             the stream as fast as possible with
>> standard
>> >     >                             buffering strategies and to support
>> this automatic
>> >     >                             IterableCoder wrapping which is
>> implicit. In other
>> >     >                             words, if beam wants to support any
>> coder,
>> >     including
>> >     >                             the ones not requiring to write the
>> size of the
>> >     >                             output - most of the codecs - then we
>> need to
>> >     change
>> >     >                             the way it works to something like
>> that which does
>> >     >                             it for the user which doesn't know
>> its coder got
>> >     >                             wrapped.
>> >     >
>> >     >                             Hope it makes sense, if not, don't
>> hesitate to ask
>> >     >                             questions.
>> >     >
>> >     >                             Happy end of week-end.
>> >     >
>> >     >                             Romain Manni-Bucau
>> >     >                             @rmannibucau <
>> https://twitter.com/rmannibucau
>> >     <https://twitter.com/rmannibucau>> |
>> >     >                              Blog <https://rmannibucau.metawerx.
>> net/
>> >     <https://rmannibucau.metawerx.net/>> | Old Blog
>> >     >                             <http://rmannibucau.wordpress.com
>> >     <http://rmannibucau.wordpress.com>> | Github
>> >     >                             <https://github.com/rmannibucau
>> >     <https://github.com/rmannibucau>> | LinkedIn
>> >     >                             <https://www.linkedin.com/in/
>> rmannibucau
>> >     <https://www.linkedin.com/in/rmannibucau>> | Book
>> >     >
>> >      <https://www.packtpub.com/application-development/java-ee-
>> 8-high-performance <https://www.packtpub.com/appl
>> ication-development/java-ee-8-high-performance>>
>> >     >
>> >     >
>> >     >
>> >     >
>> >     >
>> >     >
>> >     >
>> >
>> >     --
>> >     Jean-Baptiste Onofré
>> >     jbonofre@apache.org <ma...@apache.org>
>> >     http://blog.nanthrax.net
>> >     Talend - http://www.talend.com
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: coder evolutions?

Posted by Reuven Lax <re...@google.com>.

Seems fine to me. At some point we might want to do an audit of existing
Jira issues, because I suspect there are issues that should be targeted to
3.0 but are not yet tagged.

On Sun, Feb 4, 2018 at 11:41 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0
> version.
>
> WDYT ?
>
> Regards
> JB
>
> On 02/04/2018 07:55 PM, Reuven Lax wrote:
> > Do we have a good place to track the items for Beam 3.0, or is Jira the
> best
> > place? Romain has a good point - if this gets forgotten when we do Beam
> 3.0,
> > then we're stuck waiting around till Beam 4.0.
> >
> > Reuven
> >
> > On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> > <ma...@nanthrax.net>> wrote:
> >
> >     That's a good point. In the roadmap for Beam 3, I think it makes
> sense to add a
> >     point about this.
> >
> >     Regards
> >     JB
> >
> >     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> >     > I think doing a change that would break pipeline update for every
> single user of
> >     > Flink and Dataflow needs to be postponed until a next major
> version. Pipeline
> >     > update is a very frequently used feature, especially by the
> largest users. We've
> >     > had those users get significantly upset even when we accidentally
> broke update
> >     > compatibility for some special cases of individual transforms;
> breaking it
> >     > intentionally and project-wide is too extreme to be justified by
> the benefits of
> >     > the current change.
> >     >
> >     > That said, I think concerns about coder APIs are reasonable, and
> it is
> >     > unfortunate that we effectively can't make changes to them right
> now. It would
> >     > be great if in the next major version we were better prepared for
> evolution of
> >     > coders, e.g. by having coders support a version marker or
> something like that,
> >     > with an API for detecting the version of data on wire and reading
> or writing
> >     > data of an old version. Such a change (introducing versioning)
> would also, of
> >     > course, be incompatible and would need to be postponed until a
> major version -
> >     > but, at least, subsequent changes wouldn't.
> >     >
> >     > ...And as I was typing this email, seems that this is what the
> thread already
> >     > came to!
> >     >
> >     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <
> rmannibucau@gmail.com <ma...@gmail.com>
> >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >     I like this idea of migration support at coder level. It would
> require to
> >     >     add a metadata in all outputs which would represent the
> version then coders
> >     >     can handle the logic properly depending the version - we can
> assume a coder
> >     >     dev upgrade the version when he breaks the representation I
> hope ;).
> >     >     With this: no runner impact at all :).
> >     >
> >     >
> >     >     Romain Manni-Bucau
> >     >     @rmannibucau <https://twitter.com/rmannibucau
> >     <https://twitter.com/rmannibucau>> |  Blog
> >     >     <https://rmannibucau.metawerx.net/
> >     <https://rmannibucau.metawerx.net/>> | Old Blog
> >     >     <http://rmannibucau.wordpress.com <
> http://rmannibucau.wordpress.com>>
> >     | Github
> >     >     <https://github.com/rmannibucau <https://github.com/
> rmannibucau>> |
> >     LinkedIn
> >     >     <https://www.linkedin.com/in/rmannibucau
> >     <https://www.linkedin.com/in/rmannibucau>> | Book
> >     >
> >      <https://www.packtpub.com/application-development/java-
> ee-8-high-performance <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>>
> >     >
> >     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com
> <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>:
> >     >
> >     >         It would already break quite a number of users at this
> point.
> >     >
> >     >         I think what we should be doing is moving forward on the
> snapshot/update
> >     >         proposal. That proposal actually provides a way forward
> when coders
> >     >         change (it proposes a way to map an old snapshot to one
> using the new
> >     >         coder, so changes to coders in the future will be much
> easier to make.
> >     >         However much of the implementation for this will likely be
> at the runner
> >     >         level, not the SDK level.
> >     >
> >     >         Reuven
> >     >
> >     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
> >     >         <rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >             I fully understand that, and this is one of the reason
> managing to
> >     >             solve these issues is very important and ASAP. My
> conclusion is that
> >     >             we must break it now to avoid to do it later when
> usage will be way
> >     >             more developped - I would be very happy to be wrong on
> that point -
> >     >             so I started this PR and this thread. We can postpone
> it but it
> >     >             would break later so for probably more users.
> >     >
> >     >
> >     >             Romain Manni-Bucau
> >     >             @rmannibucau <https://twitter.com/rmannibucau
> >     <https://twitter.com/rmannibucau>> |  Blog
> >     >             <https://rmannibucau.metawerx.net/
> >     <https://rmannibucau.metawerx.net/>> | Old Blog
> >     >             <http://rmannibucau.wordpress.com
> >     <http://rmannibucau.wordpress.com>> | Github
> >     >             <https://github.com/rmannibucau
> >     <https://github.com/rmannibucau>> | LinkedIn
> >     >             <https://www.linkedin.com/in/rmannibucau
> >     <https://www.linkedin.com/in/rmannibucau>> | Book
> >     >
> >      <https://www.packtpub.com/application-development/java-
> ee-8-high-performance <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>>
> >     >
> >     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <
> relax@google.com <ma...@google.com>
> >     >             <mailto:relax@google.com <ma...@google.com>>>:
> >     >
> >     >                 Unfortunately several runners (at least Flink and
> Dataflow)
> >     >                 support in-place update of streaming pipelines as
> a key feature,
> >     >                 and changing coder format breaks this. This is a
> very important
> >     >                 feature of both runners, and we should endeavor
> not to break them.
> >     >
> >     >                 In-place snapshot and update is also a top-level
> Beam proposal
> >     >                 that was received positively, though neither of
> those runners
> >     >                 yet implement the proposed interface.
> >     >
> >     >                 Reuven
> >     >
> >     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
> >     >                 <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >                     Sadly yes, and why the PR is actually WIP. As
> mentionned it
> >     >                     modifies it and requires some updates in other
> languages and
> >     >                     the standard_coders.yml file (I didn't find
> how this file
> >     >                     was generated).
> >     >                     Since coders must be about volatile data I
> don't think it is
> >     >                     a big deal to change it though.
> >     >
> >     >
> >     >                     Romain Manni-Bucau
> >     >                     @rmannibucau <https://twitter.com/rmannibucau
> >     <https://twitter.com/rmannibucau>> |  Blog
> >     >                     <https://rmannibucau.metawerx.net/
> >     <https://rmannibucau.metawerx.net/>> | Old Blog
> >     >                     <http://rmannibucau.wordpress.com
> >     <http://rmannibucau.wordpress.com>> | Github
> >     >                     <https://github.com/rmannibucau
> >     <https://github.com/rmannibucau>> | LinkedIn
> >     >                     <https://www.linkedin.com/in/rmannibucau
> >     <https://www.linkedin.com/in/rmannibucau>> | Book
> >     >
> >      <https://www.packtpub.com/application-development/java-
> ee-8-high-performance <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>>
> >     >
> >     >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <
> relax@google.com <ma...@google.com>
> >     >                     <mailto:relax@google.com <mailto:
> relax@google.com>>>:
> >     >
> >     >                         One question - does this change the actual
> byte encoding
> >     >                         of elements? We've tried hard not to do
> that so far for
> >     >                         reasons of compatibility.
> >     >
> >     >                         Reuven
> >     >
> >     >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain
> Manni-Bucau
> >     >                         <rmannibucau@gmail.com
> >     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
> >     <ma...@gmail.com>>>
> >     >                         wrote:
> >     >
> >     >                             Hi guys,
> >     >
> >     >                             I submitted a PR on coders to enhance
> 1. the user
> >     >                             experience 2. the determinism and
> handling of
> >     coders.
> >     >
> >     >                             1. the user experience is linked to
> what i
> >     sent some
> >     >                             days ago: close handling of the
> streams from a
> >     coder
> >     >                             code. Long story short I add a
> SkipCloseCoder
> >     which
> >     >                             can decorate a coder and just wraps
> the stream
> >     >                             (input or output) in flavors skipping
> close()
> >     calls.
> >     >                             This avoids to do it by default (which
> had my
> >     >                             preference if you read the related
> thread but not
> >     >                             the one of everybody) but also makes
> the usage
> >     of a
> >     >                             coder with this issue easy since the
> of() of the
> >     >                             coder just wraps itself in this
> delagating coder.
> >     >
> >     >                             2. this one is more nasty and mainly
> concerns
> >     >                             IterableLikeCoders. These ones use
> this kind of
> >     >                             algorithm (keep in mind they work on a
> list):
> >     >
> >     >                             writeSize()
> >     >                             for all element e {
> >     >                                 elementCoder.write(e)
> >     >                             }
> >     >                             writeMagicNumber() // this one depends
> the size
> >     >
> >     >                             The decoding is symmetric so I bypass
> it here.
> >     >
> >     >                             Indeed all these writes (reads) are
> done on
> >     the same
> >     >                             stream. Therefore it assumes you read
> as much
> >     bytes
> >     >                             than you write...which is a huge
> assumption for a
> >     >                             coder which should by contract assume
> it can read
> >     >                             the stream...as a stream (until -1).
> >     >
> >     >                             The idea of the fix is to change this
> encoding to
> >     >                             this kind of algorithm:
> >     >
> >     >                             writeSize()
> >     >                             for all element e {
> >     >                                 writeElementByteCount(e)
> >     >                                 elementCoder.write(e)
> >     >                             }
> >     >                             writeMagicNumber() // still optionally
> >     >
> >     >                             This way on the decode size you can
> wrap the
> >     stream
> >     >                             by element to enforce the limitation
> of the
> >     byte count.
> >     >
> >     >                             Side note: this indeed enforce a
> limitation due to
> >     >                             java byte limitation but if you check
> coder
> >     code it
> >     >                             is already here at the higher level so
> it is not a
> >     >                             big deal for now.
> >     >
> >     >                             In terms of implementation it uses a
> >     >                             LengthAwareCoder which delegates to
> another coder
> >     >                             the encoding and just adds the byte
> count
> >     before the
> >     >                             actual serialization. Not perfect but
> should
> >     be more
> >     >                             than enough in terms of support and
> perf for
> >     beam if
> >     >                             you think real pipelines (we try to
> avoid
> >     >                             serializations or it is done on some
> well known
> >     >                             points where this algo should be
> >     enough...worse case
> >     >                             it is not a huge overhead, mainly just
> some memory
> >     >                             overhead).
> >     >
> >     >
> >     >                             The PR is available
> >     >                             at https://github.com/apache/
> beam/pull/4594
> >     <https://github.com/apache/beam/pull/4594>. If you
> >     >                             check you will see I put it "WIP". The
> main reason
> >     >                             is that it changes the encoding format
> for
> >     >                             containers (lists, iterable, ...) and
> therefore
> >     >                             breaks python/go/... tests and the
> >     >                             standard_coders.yml definition. Some
> help on that
> >     >                             would be very welcomed.
> >     >
> >     >                             Technical side note if you
> >     >                             wonder: UnownedInputStream doesn't
> even allow to
> >     >                             mark the stream so there is no real
> fast way
> >     to read
> >     >                             the stream as fast as possible with
> standard
> >     >                             buffering strategies and to support
> this automatic
> >     >                             IterableCoder wrapping which is
> implicit. In other
> >     >                             words, if beam wants to support any
> coder,
> >     including
> >     >                             the ones not requiring to write the
> size of the
> >     >                             output - most of the codecs - then we
> need to
> >     change
> >     >                             the way it works to something like
> that which does
> >     >                             it for the user which doesn't know its
> coder got
> >     >                             wrapped.
> >     >
> >     >                             Hope it makes sense, if not, don't
> hesitate to ask
> >     >                             questions.
> >     >
> >     >                             Happy end of week-end.
> >     >
> >     >                             Romain Manni-Bucau
> >     >                             @rmannibucau <https://twitter.com/
> rmannibucau
> >     <https://twitter.com/rmannibucau>> |
> >     >                              Blog <https://rmannibucau.metawerx.
> net/
> >     <https://rmannibucau.metawerx.net/>> | Old Blog
> >     >                             <http://rmannibucau.wordpress.com
> >     <http://rmannibucau.wordpress.com>> | Github
> >     >                             <https://github.com/rmannibucau
> >     <https://github.com/rmannibucau>> | LinkedIn
> >     >                             <https://www.linkedin.com/in/
> rmannibucau
> >     <https://www.linkedin.com/in/rmannibucau>> | Book
> >     >
> >      <https://www.packtpub.com/application-development/java-
> ee-8-high-performance <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>>
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >
> >     --
> >     Jean-Baptiste Onofré
> >     jbonofre@apache.org <ma...@apache.org>
> >     http://blog.nanthrax.net
> >     Talend - http://www.talend.com
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: coder evolutions?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

I would prefer to use Jira, with "wish"/"ideas", and adding Beam 3.0.0 version.

WDYT ?

Regards
JB

On 02/04/2018 07:55 PM, Reuven Lax wrote:
> Do we have a good place to track the items for Beam 3.0, or is Jira the best
> place? Romain has a good point - if this gets forgotten when we do Beam 3.0,
> then we're stuck waiting around till Beam 4.0.
> 
> Reuven
> 
> On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     That's a good point. In the roadmap for Beam 3, I think it makes sense to add a
>     point about this.
> 
>     Regards
>     JB
> 
>     On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
>     > I think doing a change that would break pipeline update for every single user of
>     > Flink and Dataflow needs to be postponed until a next major version. Pipeline
>     > update is a very frequently used feature, especially by the largest users. We've
>     > had those users get significantly upset even when we accidentally broke update
>     > compatibility for some special cases of individual transforms; breaking it
>     > intentionally and project-wide is too extreme to be justified by the benefits of
>     > the current change.
>     >
>     > That said, I think concerns about coder APIs are reasonable, and it is
>     > unfortunate that we effectively can't make changes to them right now. It would
>     > be great if in the next major version we were better prepared for evolution of
>     > coders, e.g. by having coders support a version marker or something like that,
>     > with an API for detecting the version of data on wire and reading or writing
>     > data of an old version. Such a change (introducing versioning) would also, of
>     > course, be incompatible and would need to be postponed until a major version -
>     > but, at least, subsequent changes wouldn't.
>     >
>     > ...And as I was typing this email, seems that this is what the thread already
>     > came to!
>     >
>     > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rmannibucau@gmail.com <ma...@gmail.com>
>     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >     I like this idea of migration support at coder level. It would require to
>     >     add a metadata in all outputs which would represent the version then coders
>     >     can handle the logic properly depending the version - we can assume a coder
>     >     dev upgrade the version when he breaks the representation I hope ;).
>     >     With this: no runner impact at all :).
>     >
>     >
>     >     Romain Manni-Bucau
>     >     @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >     <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >     <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com>>
>     | Github
>     >     <https://github.com/rmannibucau <https://github.com/rmannibucau>> |
>     LinkedIn
>     >     <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >   
>      <https://www.packtpub.com/application-development/java-ee-8-high-performance <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >         It would already break quite a number of users at this point.
>     >
>     >         I think what we should be doing is moving forward on the snapshot/update
>     >         proposal. That proposal actually provides a way forward when coders
>     >         change (it proposes a way to map an old snapshot to one using the new
>     >         coder, so changes to coders in the future will be much easier to make.
>     >         However much of the implementation for this will likely be at the runner
>     >         level, not the SDK level.
>     >
>     >         Reuven
>     >
>     >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
>     >         <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >             I fully understand that, and this is one of the reason managing to
>     >             solve these issues is very important and ASAP. My conclusion is that
>     >             we must break it now to avoid to do it later when usage will be way
>     >             more developped - I would be very happy to be wrong on that point -
>     >             so I started this PR and this thread. We can postpone it but it
>     >             would break later so for probably more users.
>     >
>     >
>     >             Romain Manni-Bucau
>     >             @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >             <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >             <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >             <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >             <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >           
>      <https://www.packtpub.com/application-development/java-ee-8-high-performance <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >             2018-02-04 17:49 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com>
>     >             <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >                 Unfortunately several runners (at least Flink and Dataflow)
>     >                 support in-place update of streaming pipelines as a key feature,
>     >                 and changing coder format breaks this. This is a very important
>     >                 feature of both runners, and we should endeavor not to break them.
>     >
>     >                 In-place snapshot and update is also a top-level Beam proposal
>     >                 that was received positively, though neither of those runners
>     >                 yet implement the proposed interface.
>     >
>     >                 Reuven
>     >
>     >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
>     >                 <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >                     Sadly yes, and why the PR is actually WIP. As mentionned it
>     >                     modifies it and requires some updates in other languages and
>     >                     the standard_coders.yml file (I didn't find how this file
>     >                     was generated).
>     >                     Since coders must be about volatile data I don't think it is
>     >                     a big deal to change it though.
>     >
>     >
>     >                     Romain Manni-Bucau
>     >                     @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |  Blog
>     >                     <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >                     <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >                     <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >                     <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >                   
>      <https://www.packtpub.com/application-development/java-ee-8-high-performance <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com>
>     >                     <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >                         One question - does this change the actual byte encoding
>     >                         of elements? We've tried hard not to do that so far for
>     >                         reasons of compatibility.
>     >
>     >                         Reuven
>     >
>     >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>     >                         <rmannibucau@gmail.com
>     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>>
>     >                         wrote:
>     >
>     >                             Hi guys,
>     >
>     >                             I submitted a PR on coders to enhance 1. the user
>     >                             experience 2. the determinism and handling of
>     coders.
>     >
>     >                             1. the user experience is linked to what i
>     sent some
>     >                             days ago: close handling of the streams from a
>     coder
>     >                             code. Long story short I add a SkipCloseCoder
>     which
>     >                             can decorate a coder and just wraps the stream
>     >                             (input or output) in flavors skipping close()
>     calls.
>     >                             This avoids to do it by default (which had my
>     >                             preference if you read the related thread but not
>     >                             the one of everybody) but also makes the usage
>     of a
>     >                             coder with this issue easy since the of() of the
>     >                             coder just wraps itself in this delagating coder.
>     >
>     >                             2. this one is more nasty and mainly concerns
>     >                             IterableLikeCoders. These ones use this kind of
>     >                             algorithm (keep in mind they work on a list):
>     >
>     >                             writeSize()
>     >                             for all element e {
>     >                                 elementCoder.write(e)
>     >                             }
>     >                             writeMagicNumber() // this one depends the size
>     >
>     >                             The decoding is symmetric so I bypass it here.
>     >
>     >                             Indeed all these writes (reads) are done on
>     the same
>     >                             stream. Therefore it assumes you read as much
>     bytes
>     >                             than you write...which is a huge assumption for a
>     >                             coder which should by contract assume it can read
>     >                             the stream...as a stream (until -1).
>     >
>     >                             The idea of the fix is to change this encoding to
>     >                             this kind of algorithm:
>     >
>     >                             writeSize()
>     >                             for all element e {
>     >                                 writeElementByteCount(e)
>     >                                 elementCoder.write(e)
>     >                             }
>     >                             writeMagicNumber() // still optionally
>     >
>     >                             This way on the decode size you can wrap the
>     stream
>     >                             by element to enforce the limitation of the
>     byte count.
>     >
>     >                             Side note: this indeed enforce a limitation due to
>     >                             java byte limitation but if you check coder
>     code it
>     >                             is already here at the higher level so it is not a
>     >                             big deal for now.
>     >
>     >                             In terms of implementation it uses a
>     >                             LengthAwareCoder which delegates to another coder
>     >                             the encoding and just adds the byte count
>     before the
>     >                             actual serialization. Not perfect but should
>     be more
>     >                             than enough in terms of support and perf for
>     beam if
>     >                             you think real pipelines (we try to avoid
>     >                             serializations or it is done on some well known
>     >                             points where this algo should be
>     enough...worse case
>     >                             it is not a huge overhead, mainly just some memory
>     >                             overhead).
>     >
>     >
>     >                             The PR is available
>     >                             at https://github.com/apache/beam/pull/4594
>     <https://github.com/apache/beam/pull/4594>. If you
>     >                             check you will see I put it "WIP". The main reason
>     >                             is that it changes the encoding format for
>     >                             containers (lists, iterable, ...) and therefore
>     >                             breaks python/go/... tests and the
>     >                             standard_coders.yml definition. Some help on that
>     >                             would be very welcomed.
>     >
>     >                             Technical side note if you
>     >                             wonder: UnownedInputStream doesn't even allow to
>     >                             mark the stream so there is no real fast way
>     to read
>     >                             the stream as fast as possible with standard
>     >                             buffering strategies and to support this automatic
>     >                             IterableCoder wrapping which is implicit. In other
>     >                             words, if beam wants to support any coder,
>     including
>     >                             the ones not requiring to write the size of the
>     >                             output - most of the codecs - then we need to
>     change
>     >                             the way it works to something like that which does
>     >                             it for the user which doesn't know its coder got
>     >                             wrapped.
>     >
>     >                             Hope it makes sense, if not, don't hesitate to ask
>     >                             questions.
>     >
>     >                             Happy end of week-end.
>     >
>     >                             Romain Manni-Bucau
>     >                             @rmannibucau <https://twitter.com/rmannibucau
>     <https://twitter.com/rmannibucau>> |
>     >                              Blog <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old Blog
>     >                             <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >                             <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >                             <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>> | Book
>     >                           
>      <https://www.packtpub.com/application-development/java-ee-8-high-performance <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>     >
>     >
>     >
>     >
>     >
>     >
>     >
> 
>     --
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: coder evolutions?

Posted by Reuven Lax <re...@google.com>.

Do we have a good place to track the items for Beam 3.0, or is Jira the
best place? Romain has a good point - if this gets forgotten when we do
Beam 3.0, then we're stuck waiting around till Beam 4.0.

Reuven

On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> That's a good point. In the roadmap for Beam 3, I think it makes sense to
> add a
> point about this.
>
> Regards
> JB
>
> On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> > I think doing a change that would break pipeline update for every single
> user of
> > Flink and Dataflow needs to be postponed until a next major version.
> Pipeline
> > update is a very frequently used feature, especially by the largest
> users. We've
> > had those users get significantly upset even when we accidentally broke
> update
> > compatibility for some special cases of individual transforms; breaking
> it
> > intentionally and project-wide is too extreme to be justified by the
> benefits of
> > the current change.
> >
> > That said, I think concerns about coder APIs are reasonable, and it is
> > unfortunate that we effectively can't make changes to them right now. It
> would
> > be great if in the next major version we were better prepared for
> evolution of
> > coders, e.g. by having coders support a version marker or something like
> that,
> > with an API for detecting the version of data on wire and reading or
> writing
> > data of an old version. Such a change (introducing versioning) would
> also, of
> > course, be incompatible and would need to be postponed until a major
> version -
> > but, at least, subsequent changes wouldn't.
> >
> > ...And as I was typing this email, seems that this is what the thread
> already
> > came to!
> >
> > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rmannibucau@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     I like this idea of migration support at coder level. It would
> require to
> >     add a metadata in all outputs which would represent the version then
> coders
> >     can handle the logic properly depending the version - we can assume
> a coder
> >     dev upgrade the version when he breaks the representation I hope ;).
> >     With this: no runner impact at all :).
> >
> >
> >     Romain Manni-Bucau
> >     @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >     <https://rmannibucau.metawerx.net/> | Old Blog
> >     <http://rmannibucau.wordpress.com> | Github
> >     <https://github.com/rmannibucau> | LinkedIn
> >     <https://www.linkedin.com/in/rmannibucau> | Book
> >     <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> >     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com
> >     <ma...@google.com>>:
> >
> >         It would already break quite a number of users at this point.
> >
> >         I think what we should be doing is moving forward on the
> snapshot/update
> >         proposal. That proposal actually provides a way forward when
> coders
> >         change (it proposes a way to map an old snapshot to one using
> the new
> >         coder, so changes to coders in the future will be much easier to
> make.
> >         However much of the implementation for this will likely be at
> the runner
> >         level, not the SDK level.
> >
> >         Reuven
> >
> >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
> >         <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> >
> >             I fully understand that, and this is one of the reason
> managing to
> >             solve these issues is very important and ASAP. My conclusion
> is that
> >             we must break it now to avoid to do it later when usage will
> be way
> >             more developped - I would be very happy to be wrong on that
> point -
> >             so I started this PR and this thread. We can postpone it but
> it
> >             would break later so for probably more users.
> >
> >
> >             Romain Manni-Bucau
> >             @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >             <https://rmannibucau.metawerx.net/> | Old Blog
> >             <http://rmannibucau.wordpress.com> | Github
> >             <https://github.com/rmannibucau> | LinkedIn
> >             <https://www.linkedin.com/in/rmannibucau> | Book
> >             <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> >             2018-02-04 17:49 GMT+01:00 Reuven Lax <relax@google.com
> >             <ma...@google.com>>:
> >
> >                 Unfortunately several runners (at least Flink and
> Dataflow)
> >                 support in-place update of streaming pipelines as a key
> feature,
> >                 and changing coder format breaks this. This is a very
> important
> >                 feature of both runners, and we should endeavor not to
> break them.
> >
> >                 In-place snapshot and update is also a top-level Beam
> proposal
> >                 that was received positively, though neither of those
> runners
> >                 yet implement the proposed interface.
> >
> >                 Reuven
> >
> >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
> wrote:
> >
> >                     Sadly yes, and why the PR is actually WIP. As
> mentionned it
> >                     modifies it and requires some updates in other
> languages and
> >                     the standard_coders.yml file (I didn't find how this
> file
> >                     was generated).
> >                     Since coders must be about volatile data I don't
> think it is
> >                     a big deal to change it though.
> >
> >
> >                     Romain Manni-Bucau
> >                     @rmannibucau <https://twitter.com/rmannibucau> |
>  Blog
> >                     <https://rmannibucau.metawerx.net/> | Old Blog
> >                     <http://rmannibucau.wordpress.com> | Github
> >                     <https://github.com/rmannibucau> | LinkedIn
> >                     <https://www.linkedin.com/in/rmannibucau> | Book
> >                     <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>
> >
> >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <
> relax@google.com
> >                     <ma...@google.com>>:
> >
> >                         One question - does this change the actual byte
> encoding
> >                         of elements? We've tried hard not to do that so
> far for
> >                         reasons of compatibility.
> >
> >                         Reuven
> >
> >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain
> Manni-Bucau
> >                         <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>>
> >                         wrote:
> >
> >                             Hi guys,
> >
> >                             I submitted a PR on coders to enhance 1. the
> user
> >                             experience 2. the determinism and handling
> of coders.
> >
> >                             1. the user experience is linked to what i
> sent some
> >                             days ago: close handling of the streams from
> a coder
> >                             code. Long story short I add a
> SkipCloseCoder which
> >                             can decorate a coder and just wraps the
> stream
> >                             (input or output) in flavors skipping
> close() calls.
> >                             This avoids to do it by default (which had my
> >                             preference if you read the related thread
> but not
> >                             the one of everybody) but also makes the
> usage of a
> >                             coder with this issue easy since the of() of
> the
> >                             coder just wraps itself in this delagating
> coder.
> >
> >                             2. this one is more nasty and mainly concerns
> >                             IterableLikeCoders. These ones use this kind
> of
> >                             algorithm (keep in mind they work on a list):
> >
> >                             writeSize()
> >                             for all element e {
> >                                 elementCoder.write(e)
> >                             }
> >                             writeMagicNumber() // this one depends the
> size
> >
> >                             The decoding is symmetric so I bypass it
> here.
> >
> >                             Indeed all these writes (reads) are done on
> the same
> >                             stream. Therefore it assumes you read as
> much bytes
> >                             than you write...which is a huge assumption
> for a
> >                             coder which should by contract assume it can
> read
> >                             the stream...as a stream (until -1).
> >
> >                             The idea of the fix is to change this
> encoding to
> >                             this kind of algorithm:
> >
> >                             writeSize()
> >                             for all element e {
> >                                 writeElementByteCount(e)
> >                                 elementCoder.write(e)
> >                             }
> >                             writeMagicNumber() // still optionally
> >
> >                             This way on the decode size you can wrap the
> stream
> >                             by element to enforce the limitation of the
> byte count.
> >
> >                             Side note: this indeed enforce a limitation
> due to
> >                             java byte limitation but if you check coder
> code it
> >                             is already here at the higher level so it is
> not a
> >                             big deal for now.
> >
> >                             In terms of implementation it uses a
> >                             LengthAwareCoder which delegates to another
> coder
> >                             the encoding and just adds the byte count
> before the
> >                             actual serialization. Not perfect but should
> be more
> >                             than enough in terms of support and perf for
> beam if
> >                             you think real pipelines (we try to avoid
> >                             serializations or it is done on some well
> known
> >                             points where this algo should be
> enough...worse case
> >                             it is not a huge overhead, mainly just some
> memory
> >                             overhead).
> >
> >
> >                             The PR is available
> >                             at https://github.com/apache/beam/pull/4594.
> If you
> >                             check you will see I put it "WIP". The main
> reason
> >                             is that it changes the encoding format for
> >                             containers (lists, iterable, ...) and
> therefore
> >                             breaks python/go/... tests and the
> >                             standard_coders.yml definition. Some help on
> that
> >                             would be very welcomed.
> >
> >                             Technical side note if you
> >                             wonder: UnownedInputStream doesn't even
> allow to
> >                             mark the stream so there is no real fast way
> to read
> >                             the stream as fast as possible with standard
> >                             buffering strategies and to support this
> automatic
> >                             IterableCoder wrapping which is implicit. In
> other
> >                             words, if beam wants to support any coder,
> including
> >                             the ones not requiring to write the size of
> the
> >                             output - most of the codecs - then we need
> to change
> >                             the way it works to something like that
> which does
> >                             it for the user which doesn't know its coder
> got
> >                             wrapped.
> >
> >                             Hope it makes sense, if not, don't hesitate
> to ask
> >                             questions.
> >
> >                             Happy end of week-end.
> >
> >                             Romain Manni-Bucau
> >                             @rmannibucau <https://twitter.com/
> rmannibucau> |
> >                              Blog <https://rmannibucau.metawerx.net/> |
> Old Blog
> >                             <http://rmannibucau.wordpress.com> | Github
> >                             <https://github.com/rmannibucau> | LinkedIn
> >                             <https://www.linkedin.com/in/rmannibucau> |
> Book
> >                             <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>
> >
> >
> >
> >
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: coder evolutions?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

That's a good point. In the roadmap for Beam 3, I think it makes sense to add a
point about this.

Regards
JB

On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> I think doing a change that would break pipeline update for every single user of
> Flink and Dataflow needs to be postponed until a next major version. Pipeline
> update is a very frequently used feature, especially by the largest users. We've
> had those users get significantly upset even when we accidentally broke update
> compatibility for some special cases of individual transforms; breaking it
> intentionally and project-wide is too extreme to be justified by the benefits of
> the current change.
> 
> That said, I think concerns about coder APIs are reasonable, and it is
> unfortunate that we effectively can't make changes to them right now. It would
> be great if in the next major version we were better prepared for evolution of
> coders, e.g. by having coders support a version marker or something like that,
> with an API for detecting the version of data on wire and reading or writing
> data of an old version. Such a change (introducing versioning) would also, of
> course, be incompatible and would need to be postponed until a major version -
> but, at least, subsequent changes wouldn't.
> 
> ...And as I was typing this email, seems that this is what the thread already
> came to!
> 
> On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rmannibucau@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     I like this idea of migration support at coder level. It would require to
>     add a metadata in all outputs which would represent the version then coders
>     can handle the logic properly depending the version - we can assume a coder
>     dev upgrade the version when he breaks the representation I hope ;).
>     With this: no runner impact at all :).
> 
> 
>     Romain Manni-Bucau
>     @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>     <https://rmannibucau.metawerx.net/> | Old Blog
>     <http://rmannibucau.wordpress.com> | Github
>     <https://github.com/rmannibucau> | LinkedIn
>     <https://www.linkedin.com/in/rmannibucau> | Book
>     <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
>     2018-02-04 18:09 GMT+01:00 Reuven Lax <relax@google.com
>     <ma...@google.com>>:
> 
>         It would already break quite a number of users at this point.
> 
>         I think what we should be doing is moving forward on the snapshot/update
>         proposal. That proposal actually provides a way forward when coders
>         change (it proposes a way to map an old snapshot to one using the new
>         coder, so changes to coders in the future will be much easier to make.
>         However much of the implementation for this will likely be at the runner
>         level, not the SDK level.
> 
>         Reuven
> 
>         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
>         <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> 
>             I fully understand that, and this is one of the reason managing to
>             solve these issues is very important and ASAP. My conclusion is that
>             we must break it now to avoid to do it later when usage will be way
>             more developped - I would be very happy to be wrong on that point -
>             so I started this PR and this thread. We can postpone it but it
>             would break later so for probably more users.
> 
> 
>             Romain Manni-Bucau
>             @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>             <https://rmannibucau.metawerx.net/> | Old Blog
>             <http://rmannibucau.wordpress.com> | Github
>             <https://github.com/rmannibucau> | LinkedIn
>             <https://www.linkedin.com/in/rmannibucau> | Book
>             <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
>             2018-02-04 17:49 GMT+01:00 Reuven Lax <relax@google.com
>             <ma...@google.com>>:
> 
>                 Unfortunately several runners (at least Flink and Dataflow)
>                 support in-place update of streaming pipelines as a key feature,
>                 and changing coder format breaks this. This is a very important
>                 feature of both runners, and we should endeavor not to break them.
> 
>                 In-place snapshot and update is also a top-level Beam proposal
>                 that was received positively, though neither of those runners
>                 yet implement the proposed interface.
> 
>                 Reuven
> 
>                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
>                 <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> 
>                     Sadly yes, and why the PR is actually WIP. As mentionned it
>                     modifies it and requires some updates in other languages and
>                     the standard_coders.yml file (I didn't find how this file
>                     was generated).
>                     Since coders must be about volatile data I don't think it is
>                     a big deal to change it though.
> 
> 
>                     Romain Manni-Bucau
>                     @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>                     <https://rmannibucau.metawerx.net/> | Old Blog
>                     <http://rmannibucau.wordpress.com> | Github
>                     <https://github.com/rmannibucau> | LinkedIn
>                     <https://www.linkedin.com/in/rmannibucau> | Book
>                     <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
>                     2018-02-04 17:34 GMT+01:00 Reuven Lax <relax@google.com
>                     <ma...@google.com>>:
> 
>                         One question - does this change the actual byte encoding
>                         of elements? We've tried hard not to do that so far for
>                         reasons of compatibility.
> 
>                         Reuven
> 
>                         On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>                         <rmannibucau@gmail.com <ma...@gmail.com>>
>                         wrote:
> 
>                             Hi guys,
> 
>                             I submitted a PR on coders to enhance 1. the user
>                             experience 2. the determinism and handling of coders.
> 
>                             1. the user experience is linked to what i sent some
>                             days ago: close handling of the streams from a coder
>                             code. Long story short I add a SkipCloseCoder which
>                             can decorate a coder and just wraps the stream
>                             (input or output) in flavors skipping close() calls.
>                             This avoids to do it by default (which had my
>                             preference if you read the related thread but not
>                             the one of everybody) but also makes the usage of a
>                             coder with this issue easy since the of() of the
>                             coder just wraps itself in this delagating coder.
> 
>                             2. this one is more nasty and mainly concerns
>                             IterableLikeCoders. These ones use this kind of
>                             algorithm (keep in mind they work on a list):
> 
>                             writeSize()
>                             for all element e {
>                                 elementCoder.write(e)
>                             }
>                             writeMagicNumber() // this one depends the size
> 
>                             The decoding is symmetric so I bypass it here.
> 
>                             Indeed all these writes (reads) are done on the same
>                             stream. Therefore it assumes you read as much bytes
>                             than you write...which is a huge assumption for a
>                             coder which should by contract assume it can read
>                             the stream...as a stream (until -1).
> 
>                             The idea of the fix is to change this encoding to
>                             this kind of algorithm:
> 
>                             writeSize()
>                             for all element e {
>                                 writeElementByteCount(e)
>                                 elementCoder.write(e)
>                             }
>                             writeMagicNumber() // still optionally
> 
>                             This way on the decode size you can wrap the stream
>                             by element to enforce the limitation of the byte count.
> 
>                             Side note: this indeed enforce a limitation due to
>                             java byte limitation but if you check coder code it
>                             is already here at the higher level so it is not a
>                             big deal for now.
> 
>                             In terms of implementation it uses a
>                             LengthAwareCoder which delegates to another coder
>                             the encoding and just adds the byte count before the
>                             actual serialization. Not perfect but should be more
>                             than enough in terms of support and perf for beam if
>                             you think real pipelines (we try to avoid
>                             serializations or it is done on some well known
>                             points where this algo should be enough...worse case
>                             it is not a huge overhead, mainly just some memory
>                             overhead).
> 
> 
>                             The PR is available
>                             at https://github.com/apache/beam/pull/4594. If you
>                             check you will see I put it "WIP". The main reason
>                             is that it changes the encoding format for
>                             containers (lists, iterable, ...) and therefore
>                             breaks python/go/... tests and the
>                             standard_coders.yml definition. Some help on that
>                             would be very welcomed.
> 
>                             Technical side note if you
>                             wonder: UnownedInputStream doesn't even allow to
>                             mark the stream so there is no real fast way to read
>                             the stream as fast as possible with standard
>                             buffering strategies and to support this automatic
>                             IterableCoder wrapping which is implicit. In other
>                             words, if beam wants to support any coder, including
>                             the ones not requiring to write the size of the
>                             output - most of the codecs - then we need to change
>                             the way it works to something like that which does
>                             it for the user which doesn't know its coder got
>                             wrapped.
> 
>                             Hope it makes sense, if not, don't hesitate to ask
>                             questions.
> 
>                             Happy end of week-end.
> 
>                             Romain Manni-Bucau
>                             @rmannibucau <https://twitter.com/rmannibucau> |
>                              Blog <https://rmannibucau.metawerx.net/> | Old Blog
>                             <http://rmannibucau.wordpress.com> | Github
>                             <https://github.com/rmannibucau> | LinkedIn
>                             <https://www.linkedin.com/in/rmannibucau> | Book
>                             <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 
> 
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

yep sadly :(

how should we track it properly to not forget it for v3? (I dont trust jira
much but if we don't have anything better...)

when do we start beam 3? next week? :)


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 18:18 GMT+01:00 Eugene Kirpichov <ki...@google.com>:

> I think doing a change that would break pipeline update for every single
> user of Flink and Dataflow needs to be postponed until a next major
> version. Pipeline update is a very frequently used feature, especially by
> the largest users. We've had those users get significantly upset even when
> we accidentally broke update compatibility for some special cases of
> individual transforms; breaking it intentionally and project-wide is too
> extreme to be justified by the benefits of the current change.
>
> That said, I think concerns about coder APIs are reasonable, and it is
> unfortunate that we effectively can't make changes to them right now. It
> would be great if in the next major version we were better prepared for
> evolution of coders, e.g. by having coders support a version marker or
> something like that, with an API for detecting the version of data on wire
> and reading or writing data of an old version. Such a change (introducing
> versioning) would also, of course, be incompatible and would need to be
> postponed until a major version - but, at least, subsequent changes
> wouldn't.
>
> ...And as I was typing this email, seems that this is what the thread
> already came to!
>
> On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> I like this idea of migration support at coder level. It would require to
>> add a metadata in all outputs which would represent the version then coders
>> can handle the logic properly depending the version - we can assume a coder
>> dev upgrade the version when he breaks the representation I hope ;).
>> With this: no runner impact at all :).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> It would already break quite a number of users at this point.
>>>
>>> I think what we should be doing is moving forward on the snapshot/update
>>> proposal. That proposal actually provides a way forward when coders change
>>> (it proposes a way to map an old snapshot to one using the new coder, so
>>> changes to coders in the future will be much easier to make. However much
>>> of the implementation for this will likely be at the runner level, not the
>>> SDK level.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> I fully understand that, and this is one of the reason managing to
>>>> solve these issues is very important and ASAP. My conclusion is that we
>>>> must break it now to avoid to do it later when usage will be way more
>>>> developped - I would be very happy to be wrong on that point - so I started
>>>> this PR and this thread. We can postpone it but it would break later so for
>>>> probably more users.
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>
>>>>> Unfortunately several runners (at least Flink and Dataflow) support
>>>>> in-place update of streaming pipelines as a key feature, and changing coder
>>>>> format breaks this. This is a very important feature of both runners, and
>>>>> we should endeavor not to break them.
>>>>>
>>>>> In-place snapshot and update is also a top-level Beam proposal that
>>>>> was received positively, though neither of those runners yet implement the
>>>>> proposed interface.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies
>>>>>> it and requires some updates in other languages and the standard_coders.yml
>>>>>> file (I didn't find how this file was generated).
>>>>>> Since coders must be about volatile data I don't think it is a big
>>>>>> deal to change it though.
>>>>>>
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>
>>>>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>
>>>>>>> One question - does this change the actual byte encoding of
>>>>>>> elements? We've tried hard not to do that so far for reasons of
>>>>>>> compatibility.
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>>>>> determinism and handling of coders.
>>>>>>>>
>>>>>>>> 1. the user experience is linked to what i sent some days ago:
>>>>>>>> close handling of the streams from a coder code. Long story short I add a
>>>>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>>>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>>>>>> default (which had my preference if you read the related thread but not the
>>>>>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>>>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>>>>>
>>>>>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders.
>>>>>>>> These ones use this kind of algorithm (keep in mind they work on a list):
>>>>>>>>
>>>>>>>> writeSize()
>>>>>>>> for all element e {
>>>>>>>>     elementCoder.write(e)
>>>>>>>> }
>>>>>>>> writeMagicNumber() // this one depends the size
>>>>>>>>
>>>>>>>> The decoding is symmetric so I bypass it here.
>>>>>>>>
>>>>>>>> Indeed all these writes (reads) are done on the same stream.
>>>>>>>> Therefore it assumes you read as much bytes than you write...which is a
>>>>>>>> huge assumption for a coder which should by contract assume it can read the
>>>>>>>> stream...as a stream (until -1).
>>>>>>>>
>>>>>>>> The idea of the fix is to change this encoding to this kind of
>>>>>>>> algorithm:
>>>>>>>>
>>>>>>>> writeSize()
>>>>>>>> for all element e {
>>>>>>>>     writeElementByteCount(e)
>>>>>>>>     elementCoder.write(e)
>>>>>>>> }
>>>>>>>> writeMagicNumber() // still optionally
>>>>>>>>
>>>>>>>> This way on the decode size you can wrap the stream by element to
>>>>>>>> enforce the limitation of the byte count.
>>>>>>>>
>>>>>>>> Side note: this indeed enforce a limitation due to java byte
>>>>>>>> limitation but if you check coder code it is already here at the higher
>>>>>>>> level so it is not a big deal for now.
>>>>>>>>
>>>>>>>> In terms of implementation it uses a LengthAwareCoder which
>>>>>>>> delegates to another coder the encoding and just adds the byte count before
>>>>>>>> the actual serialization. Not perfect but should be more than enough in
>>>>>>>> terms of support and perf for beam if you think real pipelines (we try to
>>>>>>>> avoid serializations or it is done on some well known points where this
>>>>>>>> algo should be enough...worse case it is not a huge overhead, mainly just
>>>>>>>> some memory overhead).
>>>>>>>>
>>>>>>>>
>>>>>>>> The PR is available at https://github.com/apache/beam/pull/4594.
>>>>>>>> If you check you will see I put it "WIP". The main reason is that it
>>>>>>>> changes the encoding format for containers (lists, iterable, ...) and
>>>>>>>> therefore breaks python/go/... tests and the standard_coders.yml
>>>>>>>> definition. Some help on that would be very welcomed.
>>>>>>>>
>>>>>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>>>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>>>>>> fast as possible with standard buffering strategies and to support this
>>>>>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>>>>>> wants to support any coder, including the ones not requiring to write the
>>>>>>>> size of the output - most of the codecs - then we need to change the way it
>>>>>>>> works to something like that which does it for the user which doesn't know
>>>>>>>> its coder got wrapped.
>>>>>>>>
>>>>>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>>>>>
>>>>>>>> Happy end of week-end.
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: coder evolutions?

Posted by Eugene Kirpichov <ki...@google.com>.

I think doing a change that would break pipeline update for every single
user of Flink and Dataflow needs to be postponed until a next major
version. Pipeline update is a very frequently used feature, especially by
the largest users. We've had those users get significantly upset even when
we accidentally broke update compatibility for some special cases of
individual transforms; breaking it intentionally and project-wide is too
extreme to be justified by the benefits of the current change.

That said, I think concerns about coder APIs are reasonable, and it is
unfortunate that we effectively can't make changes to them right now. It
would be great if in the next major version we were better prepared for
evolution of coders, e.g. by having coders support a version marker or
something like that, with an API for detecting the version of data on wire
and reading or writing data of an old version. Such a change (introducing
versioning) would also, of course, be incompatible and would need to be
postponed until a major version - but, at least, subsequent changes
wouldn't.

...And as I was typing this email, seems that this is what the thread
already came to!

On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> I like this idea of migration support at coder level. It would require to
> add a metadata in all outputs which would represent the version then coders
> can handle the logic properly depending the version - we can assume a coder
> dev upgrade the version when he breaks the representation I hope ;).
> With this: no runner impact at all :).
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> It would already break quite a number of users at this point.
>>
>> I think what we should be doing is moving forward on the snapshot/update
>> proposal. That proposal actually provides a way forward when coders change
>> (it proposes a way to map an old snapshot to one using the new coder, so
>> changes to coders in the future will be much easier to make. However much
>> of the implementation for this will likely be at the runner level, not the
>> SDK level.
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <rmannibucau@gmail.com
>> > wrote:
>>
>>> I fully understand that, and this is one of the reason managing to solve
>>> these issues is very important and ASAP. My conclusion is that we must
>>> break it now to avoid to do it later when usage will be way more developped
>>> - I would be very happy to be wrong on that point - so I started this PR
>>> and this thread. We can postpone it but it would break later so for
>>> probably more users.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>>
>>>> Unfortunately several runners (at least Flink and Dataflow) support
>>>> in-place update of streaming pipelines as a key feature, and changing coder
>>>> format breaks this. This is a very important feature of both runners, and
>>>> we should endeavor not to break them.
>>>>
>>>> In-place snapshot and update is also a top-level Beam proposal that was
>>>> received positively, though neither of those runners yet implement the
>>>> proposed interface.
>>>>
>>>> Reuven
>>>>
>>>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies
>>>>> it and requires some updates in other languages and the standard_coders.yml
>>>>> file (I didn't find how this file was generated).
>>>>> Since coders must be about volatile data I don't think it is a big
>>>>> deal to change it though.
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>
>>>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>
>>>>>> One question - does this change the actual byte encoding of elements?
>>>>>> We've tried hard not to do that so far for reasons of compatibility.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>>>> determinism and handling of coders.
>>>>>>>
>>>>>>> 1. the user experience is linked to what i sent some days ago: close
>>>>>>> handling of the streams from a coder code. Long story short I add a
>>>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>>>>> default (which had my preference if you read the related thread but not the
>>>>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>>>>
>>>>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders.
>>>>>>> These ones use this kind of algorithm (keep in mind they work on a list):
>>>>>>>
>>>>>>> writeSize()
>>>>>>> for all element e {
>>>>>>>     elementCoder.write(e)
>>>>>>> }
>>>>>>> writeMagicNumber() // this one depends the size
>>>>>>>
>>>>>>> The decoding is symmetric so I bypass it here.
>>>>>>>
>>>>>>> Indeed all these writes (reads) are done on the same stream.
>>>>>>> Therefore it assumes you read as much bytes than you write...which is a
>>>>>>> huge assumption for a coder which should by contract assume it can read the
>>>>>>> stream...as a stream (until -1).
>>>>>>>
>>>>>>> The idea of the fix is to change this encoding to this kind of
>>>>>>> algorithm:
>>>>>>>
>>>>>>> writeSize()
>>>>>>> for all element e {
>>>>>>>     writeElementByteCount(e)
>>>>>>>     elementCoder.write(e)
>>>>>>> }
>>>>>>> writeMagicNumber() // still optionally
>>>>>>>
>>>>>>> This way on the decode size you can wrap the stream by element to
>>>>>>> enforce the limitation of the byte count.
>>>>>>>
>>>>>>> Side note: this indeed enforce a limitation due to java byte
>>>>>>> limitation but if you check coder code it is already here at the higher
>>>>>>> level so it is not a big deal for now.
>>>>>>>
>>>>>>> In terms of implementation it uses a LengthAwareCoder which
>>>>>>> delegates to another coder the encoding and just adds the byte count before
>>>>>>> the actual serialization. Not perfect but should be more than enough in
>>>>>>> terms of support and perf for beam if you think real pipelines (we try to
>>>>>>> avoid serializations or it is done on some well known points where this
>>>>>>> algo should be enough...worse case it is not a huge overhead, mainly just
>>>>>>> some memory overhead).
>>>>>>>
>>>>>>>
>>>>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>>>>> help on that would be very welcomed.
>>>>>>>
>>>>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>>>>> fast as possible with standard buffering strategies and to support this
>>>>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>>>>> wants to support any coder, including the ones not requiring to write the
>>>>>>> size of the output - most of the codecs - then we need to change the way it
>>>>>>> works to something like that which does it for the user which doesn't know
>>>>>>> its coder got wrapped.
>>>>>>>
>>>>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>>>>
>>>>>>> Happy end of week-end.
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

I like this idea of migration support at coder level. It would require to
add a metadata in all outputs which would represent the version then coders
can handle the logic properly depending the version - we can assume a coder
dev upgrade the version when he breaks the representation I hope ;).
With this: no runner impact at all :).


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:

> It would already break quite a number of users at this point.
>
> I think what we should be doing is moving forward on the snapshot/update
> proposal. That proposal actually provides a way forward when coders change
> (it proposes a way to map an old snapshot to one using the new coder, so
> changes to coders in the future will be much easier to make. However much
> of the implementation for this will likely be at the runner level, not the
> SDK level.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> I fully understand that, and this is one of the reason managing to solve
>> these issues is very important and ASAP. My conclusion is that we must
>> break it now to avoid to do it later when usage will be way more developped
>> - I would be very happy to be wrong on that point - so I started this PR
>> and this thread. We can postpone it but it would break later so for
>> probably more users.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> Unfortunately several runners (at least Flink and Dataflow) support
>>> in-place update of streaming pipelines as a key feature, and changing coder
>>> format breaks this. This is a very important feature of both runners, and
>>> we should endeavor not to break them.
>>>
>>> In-place snapshot and update is also a top-level Beam proposal that was
>>> received positively, though neither of those runners yet implement the
>>> proposed interface.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>>>> and requires some updates in other languages and the standard_coders.yml
>>>> file (I didn't find how this file was generated).
>>>> Since coders must be about volatile data I don't think it is a big deal
>>>> to change it though.
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>
>>>>> One question - does this change the actual byte encoding of elements?
>>>>> We've tried hard not to do that so far for reasons of compatibility.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>>> determinism and handling of coders.
>>>>>>
>>>>>> 1. the user experience is linked to what i sent some days ago: close
>>>>>> handling of the streams from a coder code. Long story short I add a
>>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>>>> default (which had my preference if you read the related thread but not the
>>>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>>>
>>>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders.
>>>>>> These ones use this kind of algorithm (keep in mind they work on a list):
>>>>>>
>>>>>> writeSize()
>>>>>> for all element e {
>>>>>>     elementCoder.write(e)
>>>>>> }
>>>>>> writeMagicNumber() // this one depends the size
>>>>>>
>>>>>> The decoding is symmetric so I bypass it here.
>>>>>>
>>>>>> Indeed all these writes (reads) are done on the same stream.
>>>>>> Therefore it assumes you read as much bytes than you write...which is a
>>>>>> huge assumption for a coder which should by contract assume it can read the
>>>>>> stream...as a stream (until -1).
>>>>>>
>>>>>> The idea of the fix is to change this encoding to this kind of
>>>>>> algorithm:
>>>>>>
>>>>>> writeSize()
>>>>>> for all element e {
>>>>>>     writeElementByteCount(e)
>>>>>>     elementCoder.write(e)
>>>>>> }
>>>>>> writeMagicNumber() // still optionally
>>>>>>
>>>>>> This way on the decode size you can wrap the stream by element to
>>>>>> enforce the limitation of the byte count.
>>>>>>
>>>>>> Side note: this indeed enforce a limitation due to java byte
>>>>>> limitation but if you check coder code it is already here at the higher
>>>>>> level so it is not a big deal for now.
>>>>>>
>>>>>> In terms of implementation it uses a LengthAwareCoder which delegates
>>>>>> to another coder the encoding and just adds the byte count before the
>>>>>> actual serialization. Not perfect but should be more than enough in terms
>>>>>> of support and perf for beam if you think real pipelines (we try to avoid
>>>>>> serializations or it is done on some well known points where this algo
>>>>>> should be enough...worse case it is not a huge overhead, mainly just some
>>>>>> memory overhead).
>>>>>>
>>>>>>
>>>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>>>> help on that would be very welcomed.
>>>>>>
>>>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>>>> fast as possible with standard buffering strategies and to support this
>>>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>>>> wants to support any coder, including the ones not requiring to write the
>>>>>> size of the output - most of the codecs - then we need to change the way it
>>>>>> works to something like that which does it for the user which doesn't know
>>>>>> its coder got wrapped.
>>>>>>
>>>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>>>
>>>>>> Happy end of week-end.
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: coder evolutions?

Posted by Reuven Lax <re...@google.com>.

It would already break quite a number of users at this point.

I think what we should be doing is moving forward on the snapshot/update
proposal. That proposal actually provides a way forward when coders change
(it proposes a way to map an old snapshot to one using the new coder, so
changes to coders in the future will be much easier to make. However much
of the implementation for this will likely be at the runner level, not the
SDK level.

Reuven

On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> I fully understand that, and this is one of the reason managing to solve
> these issues is very important and ASAP. My conclusion is that we must
> break it now to avoid to do it later when usage will be way more developped
> - I would be very happy to be wrong on that point - so I started this PR
> and this thread. We can postpone it but it would break later so for
> probably more users.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> Unfortunately several runners (at least Flink and Dataflow) support
>> in-place update of streaming pipelines as a key feature, and changing coder
>> format breaks this. This is a very important feature of both runners, and
>> we should endeavor not to break them.
>>
>> In-place snapshot and update is also a top-level Beam proposal that was
>> received positively, though neither of those runners yet implement the
>> proposed interface.
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rmannibucau@gmail.com
>> > wrote:
>>
>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>>> and requires some updates in other languages and the standard_coders.yml
>>> file (I didn't find how this file was generated).
>>> Since coders must be about volatile data I don't think it is a big deal
>>> to change it though.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>
>>>> One question - does this change the actual byte encoding of elements?
>>>> We've tried hard not to do that so far for reasons of compatibility.
>>>>
>>>> Reuven
>>>>
>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>> determinism and handling of coders.
>>>>>
>>>>> 1. the user experience is linked to what i sent some days ago: close
>>>>> handling of the streams from a coder code. Long story short I add a
>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>>> default (which had my preference if you read the related thread but not the
>>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>>
>>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders.
>>>>> These ones use this kind of algorithm (keep in mind they work on a list):
>>>>>
>>>>> writeSize()
>>>>> for all element e {
>>>>>     elementCoder.write(e)
>>>>> }
>>>>> writeMagicNumber() // this one depends the size
>>>>>
>>>>> The decoding is symmetric so I bypass it here.
>>>>>
>>>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>>>> it assumes you read as much bytes than you write...which is a huge
>>>>> assumption for a coder which should by contract assume it can read the
>>>>> stream...as a stream (until -1).
>>>>>
>>>>> The idea of the fix is to change this encoding to this kind of
>>>>> algorithm:
>>>>>
>>>>> writeSize()
>>>>> for all element e {
>>>>>     writeElementByteCount(e)
>>>>>     elementCoder.write(e)
>>>>> }
>>>>> writeMagicNumber() // still optionally
>>>>>
>>>>> This way on the decode size you can wrap the stream by element to
>>>>> enforce the limitation of the byte count.
>>>>>
>>>>> Side note: this indeed enforce a limitation due to java byte
>>>>> limitation but if you check coder code it is already here at the higher
>>>>> level so it is not a big deal for now.
>>>>>
>>>>> In terms of implementation it uses a LengthAwareCoder which delegates
>>>>> to another coder the encoding and just adds the byte count before the
>>>>> actual serialization. Not perfect but should be more than enough in terms
>>>>> of support and perf for beam if you think real pipelines (we try to avoid
>>>>> serializations or it is done on some well known points where this algo
>>>>> should be enough...worse case it is not a huge overhead, mainly just some
>>>>> memory overhead).
>>>>>
>>>>>
>>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>>> help on that would be very welcomed.
>>>>>
>>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>>> fast as possible with standard buffering strategies and to support this
>>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>>> wants to support any coder, including the ones not requiring to write the
>>>>> size of the output - most of the codecs - then we need to change the way it
>>>>> works to something like that which does it for the user which doesn't know
>>>>> its coder got wrapped.
>>>>>
>>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>>
>>>>> Happy end of week-end.
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

I fully understand that, and this is one of the reason managing to solve
these issues is very important and ASAP. My conclusion is that we must
break it now to avoid to do it later when usage will be way more developped
- I would be very happy to be wrong on that point - so I started this PR
and this thread. We can postpone it but it would break later so for
probably more users.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:

> Unfortunately several runners (at least Flink and Dataflow) support
> in-place update of streaming pipelines as a key feature, and changing coder
> format breaks this. This is a very important feature of both runners, and
> we should endeavor not to break them.
>
> In-place snapshot and update is also a top-level Beam proposal that was
> received positively, though neither of those runners yet implement the
> proposed interface.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>> and requires some updates in other languages and the standard_coders.yml
>> file (I didn't find how this file was generated).
>> Since coders must be about volatile data I don't think it is a big deal
>> to change it though.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> One question - does this change the actual byte encoding of elements?
>>> We've tried hard not to do that so far for reasons of compatibility.
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>> determinism and handling of coders.
>>>>
>>>> 1. the user experience is linked to what i sent some days ago: close
>>>> handling of the streams from a coder code. Long story short I add a
>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>> default (which had my preference if you read the related thread but not the
>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>
>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>>>> ones use this kind of algorithm (keep in mind they work on a list):
>>>>
>>>> writeSize()
>>>> for all element e {
>>>>     elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // this one depends the size
>>>>
>>>> The decoding is symmetric so I bypass it here.
>>>>
>>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>>> it assumes you read as much bytes than you write...which is a huge
>>>> assumption for a coder which should by contract assume it can read the
>>>> stream...as a stream (until -1).
>>>>
>>>> The idea of the fix is to change this encoding to this kind of
>>>> algorithm:
>>>>
>>>> writeSize()
>>>> for all element e {
>>>>     writeElementByteCount(e)
>>>>     elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // still optionally
>>>>
>>>> This way on the decode size you can wrap the stream by element to
>>>> enforce the limitation of the byte count.
>>>>
>>>> Side note: this indeed enforce a limitation due to java byte limitation
>>>> but if you check coder code it is already here at the higher level so it is
>>>> not a big deal for now.
>>>>
>>>> In terms of implementation it uses a LengthAwareCoder which delegates
>>>> to another coder the encoding and just adds the byte count before the
>>>> actual serialization. Not perfect but should be more than enough in terms
>>>> of support and perf for beam if you think real pipelines (we try to avoid
>>>> serializations or it is done on some well known points where this algo
>>>> should be enough...worse case it is not a huge overhead, mainly just some
>>>> memory overhead).
>>>>
>>>>
>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>> help on that would be very welcomed.
>>>>
>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>> fast as possible with standard buffering strategies and to support this
>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>> wants to support any coder, including the ones not requiring to write the
>>>> size of the output - most of the codecs - then we need to change the way it
>>>> works to something like that which does it for the user which doesn't know
>>>> its coder got wrapped.
>>>>
>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>
>>>> Happy end of week-end.
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>
>>>
>>
>

Re: coder evolutions?

Posted by Reuven Lax <re...@google.com>.

Unfortunately several runners (at least Flink and Dataflow) support
in-place update of streaming pipelines as a key feature, and changing coder
format breaks this. This is a very important feature of both runners, and
we should endeavor not to break them.

In-place snapshot and update is also a top-level Beam proposal that was
received positively, though neither of those runners yet implement the
proposed interface.

Reuven

On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
> and requires some updates in other languages and the standard_coders.yml
> file (I didn't find how this file was generated).
> Since coders must be about volatile data I don't think it is a big deal to
> change it though.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> One question - does this change the actual byte encoding of elements?
>> We've tried hard not to do that so far for reasons of compatibility.
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rmannibucau@gmail.com
>> > wrote:
>>
>>> Hi guys,
>>>
>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>> determinism and handling of coders.
>>>
>>> 1. the user experience is linked to what i sent some days ago: close
>>> handling of the streams from a coder code. Long story short I add a
>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>> or output) in flavors skipping close() calls. This avoids to do it by
>>> default (which had my preference if you read the related thread but not the
>>> one of everybody) but also makes the usage of a coder with this issue easy
>>> since the of() of the coder just wraps itself in this delagating coder.
>>>
>>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>>> ones use this kind of algorithm (keep in mind they work on a list):
>>>
>>> writeSize()
>>> for all element e {
>>>     elementCoder.write(e)
>>> }
>>> writeMagicNumber() // this one depends the size
>>>
>>> The decoding is symmetric so I bypass it here.
>>>
>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>> it assumes you read as much bytes than you write...which is a huge
>>> assumption for a coder which should by contract assume it can read the
>>> stream...as a stream (until -1).
>>>
>>> The idea of the fix is to change this encoding to this kind of algorithm:
>>>
>>> writeSize()
>>> for all element e {
>>>     writeElementByteCount(e)
>>>     elementCoder.write(e)
>>> }
>>> writeMagicNumber() // still optionally
>>>
>>> This way on the decode size you can wrap the stream by element to
>>> enforce the limitation of the byte count.
>>>
>>> Side note: this indeed enforce a limitation due to java byte limitation
>>> but if you check coder code it is already here at the higher level so it is
>>> not a big deal for now.
>>>
>>> In terms of implementation it uses a LengthAwareCoder which delegates to
>>> another coder the encoding and just adds the byte count before the actual
>>> serialization. Not perfect but should be more than enough in terms of
>>> support and perf for beam if you think real pipelines (we try to avoid
>>> serializations or it is done on some well known points where this algo
>>> should be enough...worse case it is not a huge overhead, mainly just some
>>> memory overhead).
>>>
>>>
>>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>>> check you will see I put it "WIP". The main reason is that it changes the
>>> encoding format for containers (lists, iterable, ...) and therefore breaks
>>> python/go/... tests and the standard_coders.yml definition. Some help on
>>> that would be very welcomed.
>>>
>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>> allow to mark the stream so there is no real fast way to read the stream as
>>> fast as possible with standard buffering strategies and to support this
>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>> wants to support any coder, including the ones not requiring to write the
>>> size of the output - most of the codecs - then we need to change the way it
>>> works to something like that which does it for the user which doesn't know
>>> its coder got wrapped.
>>>
>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>
>>> Happy end of week-end.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>
>>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Sadly yes, and why the PR is actually WIP. As mentionned it modifies it and
requires some updates in other languages and the standard_coders.yml file
(I didn't find how this file was generated).
Since coders must be about volatile data I don't think it is a big deal to
change it though.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:

> One question - does this change the actual byte encoding of elements?
> We've tried hard not to do that so far for reasons of compatibility.
>
> Reuven
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> I submitted a PR on coders to enhance 1. the user experience 2. the
>> determinism and handling of coders.
>>
>> 1. the user experience is linked to what i sent some days ago: close
>> handling of the streams from a coder code. Long story short I add a
>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>> or output) in flavors skipping close() calls. This avoids to do it by
>> default (which had my preference if you read the related thread but not the
>> one of everybody) but also makes the usage of a coder with this issue easy
>> since the of() of the coder just wraps itself in this delagating coder.
>>
>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones use this kind of algorithm (keep in mind they work on a list):
>>
>> writeSize()
>> for all element e {
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // this one depends the size
>>
>> The decoding is symmetric so I bypass it here.
>>
>> Indeed all these writes (reads) are done on the same stream. Therefore it
>> assumes you read as much bytes than you write...which is a huge assumption
>> for a coder which should by contract assume it can read the stream...as a
>> stream (until -1).
>>
>> The idea of the fix is to change this encoding to this kind of algorithm:
>>
>> writeSize()
>> for all element e {
>>     writeElementByteCount(e)
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // still optionally
>>
>> This way on the decode size you can wrap the stream by element to enforce
>> the limitation of the byte count.
>>
>> Side note: this indeed enforce a limitation due to java byte limitation
>> but if you check coder code it is already here at the higher level so it is
>> not a big deal for now.
>>
>> In terms of implementation it uses a LengthAwareCoder which delegates to
>> another coder the encoding and just adds the byte count before the actual
>> serialization. Not perfect but should be more than enough in terms of
>> support and perf for beam if you think real pipelines (we try to avoid
>> serializations or it is done on some well known points where this algo
>> should be enough...worse case it is not a huge overhead, mainly just some
>> memory overhead).
>>
>>
>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>> check you will see I put it "WIP". The main reason is that it changes the
>> encoding format for containers (lists, iterable, ...) and therefore breaks
>> python/go/... tests and the standard_coders.yml definition. Some help on
>> that would be very welcomed.
>>
>> Technical side note if you wonder: UnownedInputStream doesn't even allow
>> to mark the stream so there is no real fast way to read the stream as fast
>> as possible with standard buffering strategies and to support this
>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>> wants to support any coder, including the ones not requiring to write the
>> size of the output - most of the codecs - then we need to change the way it
>> works to something like that which does it for the user which doesn't know
>> its coder got wrapped.
>>
>> Hope it makes sense, if not, don't hesitate to ask questions.
>>
>> Happy end of week-end.
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>
>

Re: coder evolutions?

Posted by Reuven Lax <re...@google.com>.

One question - does this change the actual byte encoding of elements? We've
tried hard not to do that so far for reasons of compatibility.

Reuven

On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> ones use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
>     elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
>     writeElementByteCount(e)
>     elementCoder.write(e)
> }
> writeMagicNumber() // still optionally
>
> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation
> but if you check coder code it is already here at the higher level so it is
> not a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow
> to mark the stream so there is no real fast way to read the stream as fast
> as possible with standard buffering strategies and to support this
> automatic IterableCoder wrapping which is implicit. In other words, if beam
> wants to support any coder, including the ones not requiring to write the
> size of the output - most of the codecs - then we need to change the way it
> works to something like that which does it for the user which doesn't know
> its coder got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Does it mean we would change the implicit resolution? Do you see it being
backward compatible? If so sounds a good solution.

Le 5 févr. 2018 20:36, "Kenneth Knowles" <kl...@google.com> a écrit :

> TL;DR: create _new_ coders is not a problem. If you have a new idea for an
> encoding, you can build it alongside and users can use it. We also need
> data migration, and this is probably the easy way to be ready for that.
>
> We made a pretty big mistake in our naming of ListCoder, SetCoder, and
> IterableLikeCoder because they make users think it is the
> only/best/canonical encoding. We did it right with e.g. VarLongCoder and
> BigEndianLongCoder. There is a default, but it is just a default.
>
> We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
> "LexicallySortedBytesSetCoder" so we can change coder inference to ask for
> a deterministic coder when it is needed instead of first asking for "any"
> coder and then crashing when we get the wrong type.
>
> Kenn
>
> On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Just to clarify, the issue is that for some types (byte array being
>> the simplest) one needs to know the length of the data in order to
>> decode it from the stream. In particular, the claim is that many
>> libraries out there that do encoding/decoding assume they can gather
>> this information from the end of the stream and so don't explicitly
>> record it. For nested values, someone needs to record these lengths.
>> Note that in the Fn API, nearly everything is nested, as the elements
>> are sent as a large byte stream of concatenated encoded elements.
>>
>> Your proposed solution is to require all container coders (though I
>> think your PR only considers IterableLikeCoder, there's others, and
>> there's the Elements proto itself) to prefix element encodings with
>> sizes so it can give truncated streams on decoding. I think this
>> places an undue burden (and code redundancy in) container coders, and
>> disallows optimization on those coders that don't need to be length
>> prefixed (and note that *prefixing* with length is not the only way to
>> delimit a stream, we shouldn't impose that restriction as well).
>> Instead, I'd keep thing the way they are, but offer a new Coder
>> subclass that users can subclass if they want to write an "easy" Coder
>> that does the delimiting for them (on encode and decode). We would
>> point users to this for writing custom coders in the easiest way
>> possible as a good option, and keeps the current Coder API the same.
>>
>> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
>> <rm...@gmail.com> wrote:
>> > Answered inlined but I want to highlight beam is a portable API on top
>> of
>> > well known vendors API which have friendly shortcuts. So the background
>> here
>> > is to make beam at least user friendly.
>> >
>> > Im fine if the outcome of the discussion is coder concept is wrong or
>> > something like that but Im not fine to say we dont want to solve an API
>> > issue, to not say bug, of a project which has an API as added value.
>> >
>> > I understand the perf concern which must be balanced with the fact
>> > derialization is not used for each step/transform and that currently the
>> > coder API is already intrusive and heavy for dev but also not usable by
>> most
>> > existing codecs out there. Even some jaxb or plain xml flavors dont work
>> > with it :(.
>> >
>> >
>> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit
>> :
>> >
>> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> > <rm...@gmail.com> wrote:
>> >> Hi guys,
>> >>
>> >> I submitted a PR on coders to enhance 1. the user experience 2. the
>> >> determinism and handling of coders.
>> >>
>> >> 1. the user experience is linked to what i sent some days ago: close
>> >> handling of the streams from a coder code. Long story short I add a
>> >> SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> >> or output) in flavors skipping close() calls. This avoids to do it by
>> >> default (which had my preference if you read the related thread but not
>> >> the
>> >> one of everybody) but also makes the usage of a coder with this issue
>> easy
>> >> since the of() of the coder just wraps itself in this delagating coder.
>> >>
>> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> >> ones
>> >> use this kind of algorithm (keep in mind they work on a list):
>> >>
>> >> writeSize()
>> >> for all element e {
>> >>     elementCoder.write(e)
>> >> }
>> >> writeMagicNumber() // this one depends the size
>> >>
>> >> The decoding is symmetric so I bypass it here.
>> >>
>> >> Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> >> assumes you read as much bytes than you write...which is a huge
>> assumption
>> >> for a coder which should by contract assume it can read the
>> stream...as a
>> >> stream (until -1).
>> >>
>> >> The idea of the fix is to change this encoding to this kind of
>> algorithm:
>> >>
>> >> writeSize()
>> >> for all element e {
>> >>     writeElementByteCount(e)
>> >>     elementCoder.write(e)
>> >> }
>> >> writeMagicNumber() // still optionally
>> >
>> > Regardless of the backwards incompatibility issues, I'm unconvinced
>> > that prefixing every element with its length is a good idea. It can
>> > lead to blow-up in size (e.g. a list of ints, and it should be noted
>> > that containers with lots of elements bias towards having small
>> > elements) and also writeElementByteCount(e) could be very inefficient
>> > for many type e (e.g. a list of lists).
>> >
>> >
>> > What is your proposal Robert then? Current restriction is clearly a
>> blocker
>> > for portability, users, determinism and is unsafe and only checkable at
>> > runtime so not something we should lead to keep.
>> >
>> > Alternative i thought about was to forbid implicit coders but it doesnt
>> help
>> > users.
>> >
>> >
>> >
>> >> This way on the decode size you can wrap the stream by element to
>> enforce
>> >> the limitation of the byte count.
>> >>
>> >> Side note: this indeed enforce a limitation due to java byte limitation
>> >> but
>> >> if you check coder code it is already here at the higher level so it is
>> >> not
>> >> a big deal for now.
>> >>
>> >> In terms of implementation it uses a LengthAwareCoder which delegates
>> to
>> >> another coder the encoding and just adds the byte count before the
>> actual
>> >> serialization. Not perfect but should be more than enough in terms of
>> >> support and perf for beam if you think real pipelines (we try to avoid
>> >> serializations or it is done on some well known points where this algo
>> >> should be enough...worse case it is not a huge overhead, mainly just
>> some
>> >> memory overhead).
>> >>
>> >>
>> >> The PR is available at https://github.com/apache/beam/pull/4594. If
>> you
>> >> check you will see I put it "WIP". The main reason is that it changes
>> the
>> >> encoding format for containers (lists, iterable, ...) and therefore
>> breaks
>> >> python/go/... tests and the standard_coders.yml definition. Some help
>> on
>> >> that would be very welcomed.
>> >>
>> >> Technical side note if you wonder: UnownedInputStream doesn't even
>> allow
>> >> to
>> >> mark the stream so there is no real fast way to read the stream as
>> fast as
>> >> possible with standard buffering strategies and to support this
>> automatic
>> >> IterableCoder wrapping which is implicit. In other words, if beam
>> wants to
>> >> support any coder, including the ones not requiring to write the size
>> of
>> >> the
>> >> output - most of the codecs - then we need to change the way it works
>> to
>> >> something like that which does it for the user which doesn't know its
>> >> coder
>> >> got wrapped.
>> >>
>> >> Hope it makes sense, if not, don't hesitate to ask questions.
>> >>
>> >> Happy end of week-end.
>> >>
>> >> Romain Manni-Bucau
>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>> >
>> >
>>
>
>

Re: coder evolutions?

Posted by Kenneth Knowles <kl...@google.com>.

TL;DR: create _new_ coders is not a problem. If you have a new idea for an
encoding, you can build it alongside and users can use it. We also need
data migration, and this is probably the easy way to be ready for that.

We made a pretty big mistake in our naming of ListCoder, SetCoder, and
IterableLikeCoder because they make users think it is the
only/best/canonical encoding. We did it right with e.g. VarLongCoder and
BigEndianLongCoder. There is a default, but it is just a default.

We actually already need "SetIterableLikeCoder" (aka SetCoder) and perhaps
"LexicallySortedBytesSetCoder" so we can change coder inference to ask for
a deterministic coder when it is needed instead of first asking for "any"
coder and then crashing when we get the wrong type.

Kenn

On Mon, Feb 5, 2018 at 11:00 AM, Robert Bradshaw <ro...@google.com>
wrote:

> Just to clarify, the issue is that for some types (byte array being
> the simplest) one needs to know the length of the data in order to
> decode it from the stream. In particular, the claim is that many
> libraries out there that do encoding/decoding assume they can gather
> this information from the end of the stream and so don't explicitly
> record it. For nested values, someone needs to record these lengths.
> Note that in the Fn API, nearly everything is nested, as the elements
> are sent as a large byte stream of concatenated encoded elements.
>
> Your proposed solution is to require all container coders (though I
> think your PR only considers IterableLikeCoder, there's others, and
> there's the Elements proto itself) to prefix element encodings with
> sizes so it can give truncated streams on decoding. I think this
> places an undue burden (and code redundancy in) container coders, and
> disallows optimization on those coders that don't need to be length
> prefixed (and note that *prefixing* with length is not the only way to
> delimit a stream, we shouldn't impose that restriction as well).
> Instead, I'd keep thing the way they are, but offer a new Coder
> subclass that users can subclass if they want to write an "easy" Coder
> that does the delimiting for them (on encode and decode). We would
> point users to this for writing custom coders in the easiest way
> possible as a good option, and keeps the current Coder API the same.
>
> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
> <rm...@gmail.com> wrote:
> > Answered inlined but I want to highlight beam is a portable API on top of
> > well known vendors API which have friendly shortcuts. So the background
> here
> > is to make beam at least user friendly.
> >
> > Im fine if the outcome of the discussion is coder concept is wrong or
> > something like that but Im not fine to say we dont want to solve an API
> > issue, to not say bug, of a project which has an API as added value.
> >
> > I understand the perf concern which must be balanced with the fact
> > derialization is not used for each step/transform and that currently the
> > coder API is already intrusive and heavy for dev but also not usable by
> most
> > existing codecs out there. Even some jaxb or plain xml flavors dont work
> > with it :(.
> >
> >
> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :
> >
> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> > <rm...@gmail.com> wrote:
> >> Hi guys,
> >>
> >> I submitted a PR on coders to enhance 1. the user experience 2. the
> >> determinism and handling of coders.
> >>
> >> 1. the user experience is linked to what i sent some days ago: close
> >> handling of the streams from a coder code. Long story short I add a
> >> SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> >> or output) in flavors skipping close() calls. This avoids to do it by
> >> default (which had my preference if you read the related thread but not
> >> the
> >> one of everybody) but also makes the usage of a coder with this issue
> easy
> >> since the of() of the coder just wraps itself in this delagating coder.
> >>
> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> >> ones
> >> use this kind of algorithm (keep in mind they work on a list):
> >>
> >> writeSize()
> >> for all element e {
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // this one depends the size
> >>
> >> The decoding is symmetric so I bypass it here.
> >>
> >> Indeed all these writes (reads) are done on the same stream. Therefore
> it
> >> assumes you read as much bytes than you write...which is a huge
> assumption
> >> for a coder which should by contract assume it can read the stream...as
> a
> >> stream (until -1).
> >>
> >> The idea of the fix is to change this encoding to this kind of
> algorithm:
> >>
> >> writeSize()
> >> for all element e {
> >>     writeElementByteCount(e)
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // still optionally
> >
> > Regardless of the backwards incompatibility issues, I'm unconvinced
> > that prefixing every element with its length is a good idea. It can
> > lead to blow-up in size (e.g. a list of ints, and it should be noted
> > that containers with lots of elements bias towards having small
> > elements) and also writeElementByteCount(e) could be very inefficient
> > for many type e (e.g. a list of lists).
> >
> >
> > What is your proposal Robert then? Current restriction is clearly a
> blocker
> > for portability, users, determinism and is unsafe and only checkable at
> > runtime so not something we should lead to keep.
> >
> > Alternative i thought about was to forbid implicit coders but it doesnt
> help
> > users.
> >
> >
> >
> >> This way on the decode size you can wrap the stream by element to
> enforce
> >> the limitation of the byte count.
> >>
> >> Side note: this indeed enforce a limitation due to java byte limitation
> >> but
> >> if you check coder code it is already here at the higher level so it is
> >> not
> >> a big deal for now.
> >>
> >> In terms of implementation it uses a LengthAwareCoder which delegates to
> >> another coder the encoding and just adds the byte count before the
> actual
> >> serialization. Not perfect but should be more than enough in terms of
> >> support and perf for beam if you think real pipelines (we try to avoid
> >> serializations or it is done on some well known points where this algo
> >> should be enough...worse case it is not a huge overhead, mainly just
> some
> >> memory overhead).
> >>
> >>
> >> The PR is available at https://github.com/apache/beam/pull/4594. If you
> >> check you will see I put it "WIP". The main reason is that it changes
> the
> >> encoding format for containers (lists, iterable, ...) and therefore
> breaks
> >> python/go/... tests and the standard_coders.yml definition. Some help on
> >> that would be very welcomed.
> >>
> >> Technical side note if you wonder: UnownedInputStream doesn't even allow
> >> to
> >> mark the stream so there is no real fast way to read the stream as fast
> as
> >> possible with standard buffering strategies and to support this
> automatic
> >> IterableCoder wrapping which is implicit. In other words, if beam wants
> to
> >> support any coder, including the ones not requiring to write the size of
> >> the
> >> output - most of the codecs - then we need to change the way it works to
> >> something like that which does it for the user which doesn't know its
> >> coder
> >> got wrapped.
> >>
> >> Hope it makes sense, if not, don't hesitate to ask questions.
> >>
> >> Happy end of week-end.
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> >
> >
>

Re: coder evolutions?

Posted by Eugene Kirpichov <ki...@google.com>.

From a brief reading of this discussion: if I understand correctly, we want
something to help deal with libraries that assume that they own the stream
(e.g. some common xml or json parsers), when using them in a context where
they don't (inside a Coder).

Setting aside the questions of "why would one even use an xml or json
library in a coder" (coders should be efficient, and the wire format of a
coder is not intended to be readable by anything except this exact coder
itself), and the question that ideally users would simply never write new
coders (I'm hoping schemas can move us in that direction) - I think we just
want an adapter for input and output streams that does this, and put this
adapter in front of both reading and writing using the library.

One way to build such an adapter is length prefixing:
- output stream collects all bytes into a byte array, and when closed,
writes the array length-prefixed
- input stream reads the length, and then lets the consumer read only as
many bytes
This adds an extra copy (extra GC pressure + extra maximum memory usage). I
suppose there has to be some price for integrating with a misbehaving
library, but I, like some others in this thread, would not be comfortable
having this overhead unconditionally even when it's not needed.

There are other ways to nest streams with considerably less overhead: e.g.
chunked length-prefixing (also copying, but less), escaping (having an
"escape" byte, e.g. 0xFF, and when the input contains this byte then write
0xFF 0xFF). However I think we still don't need to apply them
unconditionally - I see nothing wrong with applying them on a case-by-case
basis, only when your coder is actually using a misbehaving library.

On Mon, Feb 5, 2018 at 11:00 AM Robert Bradshaw <ro...@google.com> wrote:

> Just to clarify, the issue is that for some types (byte array being
> the simplest) one needs to know the length of the data in order to
> decode it from the stream. In particular, the claim is that many
> libraries out there that do encoding/decoding assume they can gather
> this information from the end of the stream and so don't explicitly
> record it. For nested values, someone needs to record these lengths.
> Note that in the Fn API, nearly everything is nested, as the elements
> are sent as a large byte stream of concatenated encoded elements.
>
> Your proposed solution is to require all container coders (though I
> think your PR only considers IterableLikeCoder, there's others, and
> there's the Elements proto itself) to prefix element encodings with
> sizes so it can give truncated streams on decoding. I think this
> places an undue burden (and code redundancy in) container coders, and
> disallows optimization on those coders that don't need to be length
> prefixed (and note that *prefixing* with length is not the only way to
> delimit a stream, we shouldn't impose that restriction as well).
> Instead, I'd keep thing the way they are, but offer a new Coder
> subclass that users can subclass if they want to write an "easy" Coder
> that does the delimiting for them (on encode and decode). We would
> point users to this for writing custom coders in the easiest way
> possible as a good option, and keeps the current Coder API the same.
>
> On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
> <rm...@gmail.com> wrote:
> > Answered inlined but I want to highlight beam is a portable API on top of
> > well known vendors API which have friendly shortcuts. So the background
> here
> > is to make beam at least user friendly.
> >
> > Im fine if the outcome of the discussion is coder concept is wrong or
> > something like that but Im not fine to say we dont want to solve an API
> > issue, to not say bug, of a project which has an API as added value.
> >
> > I understand the perf concern which must be balanced with the fact
> > derialization is not used for each step/transform and that currently the
> > coder API is already intrusive and heavy for dev but also not usable by
> most
> > existing codecs out there. Even some jaxb or plain xml flavors dont work
> > with it :(.
> >
> >
> > Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :
> >
> > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> > <rm...@gmail.com> wrote:
> >> Hi guys,
> >>
> >> I submitted a PR on coders to enhance 1. the user experience 2. the
> >> determinism and handling of coders.
> >>
> >> 1. the user experience is linked to what i sent some days ago: close
> >> handling of the streams from a coder code. Long story short I add a
> >> SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> >> or output) in flavors skipping close() calls. This avoids to do it by
> >> default (which had my preference if you read the related thread but not
> >> the
> >> one of everybody) but also makes the usage of a coder with this issue
> easy
> >> since the of() of the coder just wraps itself in this delagating coder.
> >>
> >> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> >> ones
> >> use this kind of algorithm (keep in mind they work on a list):
> >>
> >> writeSize()
> >> for all element e {
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // this one depends the size
> >>
> >> The decoding is symmetric so I bypass it here.
> >>
> >> Indeed all these writes (reads) are done on the same stream. Therefore
> it
> >> assumes you read as much bytes than you write...which is a huge
> assumption
> >> for a coder which should by contract assume it can read the stream...as
> a
> >> stream (until -1).
> >>
> >> The idea of the fix is to change this encoding to this kind of
> algorithm:
> >>
> >> writeSize()
> >> for all element e {
> >>     writeElementByteCount(e)
> >>     elementCoder.write(e)
> >> }
> >> writeMagicNumber() // still optionally
> >
> > Regardless of the backwards incompatibility issues, I'm unconvinced
> > that prefixing every element with its length is a good idea. It can
> > lead to blow-up in size (e.g. a list of ints, and it should be noted
> > that containers with lots of elements bias towards having small
> > elements) and also writeElementByteCount(e) could be very inefficient
> > for many type e (e.g. a list of lists).
> >
> >
> > What is your proposal Robert then? Current restriction is clearly a
> blocker
> > for portability, users, determinism and is unsafe and only checkable at
> > runtime so not something we should lead to keep.
> >
> > Alternative i thought about was to forbid implicit coders but it doesnt
> help
> > users.
> >
> >
> >
> >> This way on the decode size you can wrap the stream by element to
> enforce
> >> the limitation of the byte count.
> >>
> >> Side note: this indeed enforce a limitation due to java byte limitation
> >> but
> >> if you check coder code it is already here at the higher level so it is
> >> not
> >> a big deal for now.
> >>
> >> In terms of implementation it uses a LengthAwareCoder which delegates to
> >> another coder the encoding and just adds the byte count before the
> actual
> >> serialization. Not perfect but should be more than enough in terms of
> >> support and perf for beam if you think real pipelines (we try to avoid
> >> serializations or it is done on some well known points where this algo
> >> should be enough...worse case it is not a huge overhead, mainly just
> some
> >> memory overhead).
> >>
> >>
> >> The PR is available at https://github.com/apache/beam/pull/4594. If you
> >> check you will see I put it "WIP". The main reason is that it changes
> the
> >> encoding format for containers (lists, iterable, ...) and therefore
> breaks
> >> python/go/... tests and the standard_coders.yml definition. Some help on
> >> that would be very welcomed.
> >>
> >> Technical side note if you wonder: UnownedInputStream doesn't even allow
> >> to
> >> mark the stream so there is no real fast way to read the stream as fast
> as
> >> possible with standard buffering strategies and to support this
> automatic
> >> IterableCoder wrapping which is implicit. In other words, if beam wants
> to
> >> support any coder, including the ones not requiring to write the size of
> >> the
> >> output - most of the codecs - then we need to change the way it works to
> >> something like that which does it for the user which doesn't know its
> >> coder
> >> got wrapped.
> >>
> >> Hope it makes sense, if not, don't hesitate to ask questions.
> >>
> >> Happy end of week-end.
> >>
> >> Romain Manni-Bucau
> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> >
> >
>

Re: coder evolutions?

Posted by Robert Bradshaw <ro...@google.com>.

Just to clarify, the issue is that for some types (byte array being
the simplest) one needs to know the length of the data in order to
decode it from the stream. In particular, the claim is that many
libraries out there that do encoding/decoding assume they can gather
this information from the end of the stream and so don't explicitly
record it. For nested values, someone needs to record these lengths.
Note that in the Fn API, nearly everything is nested, as the elements
are sent as a large byte stream of concatenated encoded elements.

Your proposed solution is to require all container coders (though I
think your PR only considers IterableLikeCoder, there's others, and
there's the Elements proto itself) to prefix element encodings with
sizes so it can give truncated streams on decoding. I think this
places an undue burden (and code redundancy in) container coders, and
disallows optimization on those coders that don't need to be length
prefixed (and note that *prefixing* with length is not the only way to
delimit a stream, we shouldn't impose that restriction as well).
Instead, I'd keep thing the way they are, but offer a new Coder
subclass that users can subclass if they want to write an "easy" Coder
that does the delimiting for them (on encode and decode). We would
point users to this for writing custom coders in the easiest way
possible as a good option, and keeps the current Coder API the same.

On Mon, Feb 5, 2018 at 10:21 AM, Romain Manni-Bucau
<rm...@gmail.com> wrote:
> Answered inlined but I want to highlight beam is a portable API on top of
> well known vendors API which have friendly shortcuts. So the background here
> is to make beam at least user friendly.
>
> Im fine if the outcome of the discussion is coder concept is wrong or
> something like that but Im not fine to say we dont want to solve an API
> issue, to not say bug, of a project which has an API as added value.
>
> I understand the perf concern which must be balanced with the fact
> derialization is not used for each step/transform and that currently the
> coder API is already intrusive and heavy for dev but also not usable by most
> existing codecs out there. Even some jaxb or plain xml flavors dont work
> with it :(.
>
>
> Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> <rm...@gmail.com> wrote:
>> Hi guys,
>>
>> I submitted a PR on coders to enhance 1. the user experience 2. the
>> determinism and handling of coders.
>>
>> 1. the user experience is linked to what i sent some days ago: close
>> handling of the streams from a coder code. Long story short I add a
>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>> or output) in flavors skipping close() calls. This avoids to do it by
>> default (which had my preference if you read the related thread but not
>> the
>> one of everybody) but also makes the usage of a coder with this issue easy
>> since the of() of the coder just wraps itself in this delagating coder.
>>
>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones
>> use this kind of algorithm (keep in mind they work on a list):
>>
>> writeSize()
>> for all element e {
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // this one depends the size
>>
>> The decoding is symmetric so I bypass it here.
>>
>> Indeed all these writes (reads) are done on the same stream. Therefore it
>> assumes you read as much bytes than you write...which is a huge assumption
>> for a coder which should by contract assume it can read the stream...as a
>> stream (until -1).
>>
>> The idea of the fix is to change this encoding to this kind of algorithm:
>>
>> writeSize()
>> for all element e {
>>     writeElementByteCount(e)
>>     elementCoder.write(e)
>> }
>> writeMagicNumber() // still optionally
>
> Regardless of the backwards incompatibility issues, I'm unconvinced
> that prefixing every element with its length is a good idea. It can
> lead to blow-up in size (e.g. a list of ints, and it should be noted
> that containers with lots of elements bias towards having small
> elements) and also writeElementByteCount(e) could be very inefficient
> for many type e (e.g. a list of lists).
>
>
> What is your proposal Robert then? Current restriction is clearly a blocker
> for portability, users, determinism and is unsafe and only checkable at
> runtime so not something we should lead to keep.
>
> Alternative i thought about was to forbid implicit coders but it doesnt help
> users.
>
>
>
>> This way on the decode size you can wrap the stream by element to enforce
>> the limitation of the byte count.
>>
>> Side note: this indeed enforce a limitation due to java byte limitation
>> but
>> if you check coder code it is already here at the higher level so it is
>> not
>> a big deal for now.
>>
>> In terms of implementation it uses a LengthAwareCoder which delegates to
>> another coder the encoding and just adds the byte count before the actual
>> serialization. Not perfect but should be more than enough in terms of
>> support and perf for beam if you think real pipelines (we try to avoid
>> serializations or it is done on some well known points where this algo
>> should be enough...worse case it is not a huge overhead, mainly just some
>> memory overhead).
>>
>>
>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>> check you will see I put it "WIP". The main reason is that it changes the
>> encoding format for containers (lists, iterable, ...) and therefore breaks
>> python/go/... tests and the standard_coders.yml definition. Some help on
>> that would be very welcomed.
>>
>> Technical side note if you wonder: UnownedInputStream doesn't even allow
>> to
>> mark the stream so there is no real fast way to read the stream as fast as
>> possible with standard buffering strategies and to support this automatic
>> IterableCoder wrapping which is implicit. In other words, if beam wants to
>> support any coder, including the ones not requiring to write the size of
>> the
>> output - most of the codecs - then we need to change the way it works to
>> something like that which does it for the user which doesn't know its
>> coder
>> got wrapped.
>>
>> Hope it makes sense, if not, don't hesitate to ask questions.
>>
>> Happy end of week-end.
>>
>> Romain Manni-Bucau
>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>
>

Re: coder evolutions?

Posted by Raghu Angadi <ra...@google.com>.

Could you describe 2nd issue bit more in detail may be with a short example?
LengthAwareCoder in the PR adds another buffer copy..
(BufferedElementCountingOutputStream already has extra buffer copy).

On Mon, Feb 5, 2018 at 10:34 AM, Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Would this work for everyone - can update the pr if so:
>
> If coder is not built in
>     Prefix with byte size
> Else
>     Current behavior
>
> ?
>
> Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <rm...@gmail.com> a
> écrit :
>
>> Answered inlined but I want to highlight beam is a portable API on top of
>> well known vendors API which have friendly shortcuts. So the background
>> here is to make beam at least user friendly.
>>
>> Im fine if the outcome of the discussion is coder concept is wrong or
>> something like that but Im not fine to say we dont want to solve an API
>> issue, to not say bug, of a project which has an API as added value.
>>
>> I understand the perf concern which must be balanced with the fact
>> derialization is not used for each step/transform and that currently the
>> coder API is already intrusive and heavy for dev but also not usable by
>> most existing codecs out there. Even some jaxb or plain xml flavors dont
>> work with it :(.
>>
>> Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
>> <rm...@gmail.com> wrote:
>> > Hi guys,
>> >
>> > I submitted a PR on coders to enhance 1. the user experience 2. the
>> > determinism and handling of coders.
>> >
>> > 1. the user experience is linked to what i sent some days ago: close
>> > handling of the streams from a coder code. Long story short I add a
>> > SkipCloseCoder which can decorate a coder and just wraps the stream
>> (input
>> > or output) in flavors skipping close() calls. This avoids to do it by
>> > default (which had my preference if you read the related thread but not
>> the
>> > one of everybody) but also makes the usage of a coder with this issue
>> easy
>> > since the of() of the coder just wraps itself in this delagating coder.
>> >
>> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>> ones
>> > use this kind of algorithm (keep in mind they work on a list):
>> >
>> > writeSize()
>> > for all element e {
>> >     elementCoder.write(e)
>> > }
>> > writeMagicNumber() // this one depends the size
>> >
>> > The decoding is symmetric so I bypass it here.
>> >
>> > Indeed all these writes (reads) are done on the same stream. Therefore
>> it
>> > assumes you read as much bytes than you write...which is a huge
>> assumption
>> > for a coder which should by contract assume it can read the stream...as
>> a
>> > stream (until -1).
>> >
>> > The idea of the fix is to change this encoding to this kind of
>> algorithm:
>> >
>> > writeSize()
>> > for all element e {
>> >     writeElementByteCount(e)
>> >     elementCoder.write(e)
>> > }
>> > writeMagicNumber() // still optionally
>>
>> Regardless of the backwards incompatibility issues, I'm unconvinced
>> that prefixing every element with its length is a good idea. It can
>> lead to blow-up in size (e.g. a list of ints, and it should be noted
>> that containers with lots of elements bias towards having small
>> elements) and also writeElementByteCount(e) could be very inefficient
>> for many type e (e.g. a list of lists).
>>
>>
>> What is your proposal Robert then? Current restriction is clearly a
>> blocker for portability, users, determinism and is unsafe and only
>> checkable at runtime so not something we should lead to keep.
>>
>> Alternative i thought about was to forbid implicit coders but it doesnt
>> help users.
>>
>>
>>
>> > This way on the decode size you can wrap the stream by element to
>> enforce
>> > the limitation of the byte count.
>> >
>> > Side note: this indeed enforce a limitation due to java byte limitation
>> but
>> > if you check coder code it is already here at the higher level so it is
>> not
>> > a big deal for now.
>> >
>> > In terms of implementation it uses a LengthAwareCoder which delegates to
>> > another coder the encoding and just adds the byte count before the
>> actual
>> > serialization. Not perfect but should be more than enough in terms of
>> > support and perf for beam if you think real pipelines (we try to avoid
>> > serializations or it is done on some well known points where this algo
>> > should be enough...worse case it is not a huge overhead, mainly just
>> some
>> > memory overhead).
>> >
>> >
>> > The PR is available at https://github.com/apache/beam/pull/4594. If you
>> > check you will see I put it "WIP". The main reason is that it changes
>> the
>> > encoding format for containers (lists, iterable, ...) and therefore
>> breaks
>> > python/go/... tests and the standard_coders.yml definition. Some help on
>> > that would be very welcomed.
>> >
>> > Technical side note if you wonder: UnownedInputStream doesn't even
>> allow to
>> > mark the stream so there is no real fast way to read the stream as fast
>> as
>> > possible with standard buffering strategies and to support this
>> automatic
>> > IterableCoder wrapping which is implicit. In other words, if beam wants
>> to
>> > support any coder, including the ones not requiring to write the size
>> of the
>> > output - most of the codecs - then we need to change the way it works to
>> > something like that which does it for the user which doesn't know its
>> coder
>> > got wrapped.
>> >
>> > Hope it makes sense, if not, don't hesitate to ask questions.
>> >
>> > Happy end of week-end.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>
>>
>>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Would this work for everyone - can update the pr if so:

If coder is not built in
    Prefix with byte size
Else
    Current behavior

?

Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <rm...@gmail.com> a
écrit :

> Answered inlined but I want to highlight beam is a portable API on top of
> well known vendors API which have friendly shortcuts. So the background
> here is to make beam at least user friendly.
>
> Im fine if the outcome of the discussion is coder concept is wrong or
> something like that but Im not fine to say we dont want to solve an API
> issue, to not say bug, of a project which has an API as added value.
>
> I understand the perf concern which must be balanced with the fact
> derialization is not used for each step/transform and that currently the
> coder API is already intrusive and heavy for dev but also not usable by
> most existing codecs out there. Even some jaxb or plain xml flavors dont
> work with it :(.
>
> Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
> <rm...@gmail.com> wrote:
> > Hi guys,
> >
> > I submitted a PR on coders to enhance 1. the user experience 2. the
> > determinism and handling of coders.
> >
> > 1. the user experience is linked to what i sent some days ago: close
> > handling of the streams from a coder code. Long story short I add a
> > SkipCloseCoder which can decorate a coder and just wraps the stream
> (input
> > or output) in flavors skipping close() calls. This avoids to do it by
> > default (which had my preference if you read the related thread but not
> the
> > one of everybody) but also makes the usage of a coder with this issue
> easy
> > since the of() of the coder just wraps itself in this delagating coder.
> >
> > 2. this one is more nasty and mainly concerns IterableLikeCoders. These
> ones
> > use this kind of algorithm (keep in mind they work on a list):
> >
> > writeSize()
> > for all element e {
> >     elementCoder.write(e)
> > }
> > writeMagicNumber() // this one depends the size
> >
> > The decoding is symmetric so I bypass it here.
> >
> > Indeed all these writes (reads) are done on the same stream. Therefore it
> > assumes you read as much bytes than you write...which is a huge
> assumption
> > for a coder which should by contract assume it can read the stream...as a
> > stream (until -1).
> >
> > The idea of the fix is to change this encoding to this kind of algorithm:
> >
> > writeSize()
> > for all element e {
> >     writeElementByteCount(e)
> >     elementCoder.write(e)
> > }
> > writeMagicNumber() // still optionally
>
> Regardless of the backwards incompatibility issues, I'm unconvinced
> that prefixing every element with its length is a good idea. It can
> lead to blow-up in size (e.g. a list of ints, and it should be noted
> that containers with lots of elements bias towards having small
> elements) and also writeElementByteCount(e) could be very inefficient
> for many type e (e.g. a list of lists).
>
>
> What is your proposal Robert then? Current restriction is clearly a
> blocker for portability, users, determinism and is unsafe and only
> checkable at runtime so not something we should lead to keep.
>
> Alternative i thought about was to forbid implicit coders but it doesnt
> help users.
>
>
>
> > This way on the decode size you can wrap the stream by element to enforce
> > the limitation of the byte count.
> >
> > Side note: this indeed enforce a limitation due to java byte limitation
> but
> > if you check coder code it is already here at the higher level so it is
> not
> > a big deal for now.
> >
> > In terms of implementation it uses a LengthAwareCoder which delegates to
> > another coder the encoding and just adds the byte count before the actual
> > serialization. Not perfect but should be more than enough in terms of
> > support and perf for beam if you think real pipelines (we try to avoid
> > serializations or it is done on some well known points where this algo
> > should be enough...worse case it is not a huge overhead, mainly just some
> > memory overhead).
> >
> >
> > The PR is available at https://github.com/apache/beam/pull/4594. If you
> > check you will see I put it "WIP". The main reason is that it changes the
> > encoding format for containers (lists, iterable, ...) and therefore
> breaks
> > python/go/... tests and the standard_coders.yml definition. Some help on
> > that would be very welcomed.
> >
> > Technical side note if you wonder: UnownedInputStream doesn't even allow
> to
> > mark the stream so there is no real fast way to read the stream as fast
> as
> > possible with standard buffering strategies and to support this automatic
> > IterableCoder wrapping which is implicit. In other words, if beam wants
> to
> > support any coder, including the ones not requiring to write the size of
> the
> > output - most of the codecs - then we need to change the way it works to
> > something like that which does it for the user which doesn't know its
> coder
> > got wrapped.
> >
> > Hope it makes sense, if not, don't hesitate to ask questions.
> >
> > Happy end of week-end.
> >
> > Romain Manni-Bucau
> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>
>
>

Re: coder evolutions?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Answered inlined but I want to highlight beam is a portable API on top of
well known vendors API which have friendly shortcuts. So the background
here is to make beam at least user friendly.

Im fine if the outcome of the discussion is coder concept is wrong or
something like that but Im not fine to say we dont want to solve an API
issue, to not say bug, of a project which has an API as added value.

I understand the perf concern which must be balanced with the fact
derialization is not used for each step/transform and that currently the
coder API is already intrusive and heavy for dev but also not usable by
most existing codecs out there. Even some jaxb or plain xml flavors dont
work with it :(.

Le 5 févr. 2018 18:46, "Robert Bradshaw" <ro...@google.com> a écrit :

On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
<rm...@gmail.com> wrote:
> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not
the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
ones
> use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
>     elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
>     writeElementByteCount(e)
>     elementCoder.write(e)
> }
> writeMagicNumber() // still optionally

Regardless of the backwards incompatibility issues, I'm unconvinced
that prefixing every element with its length is a good idea. It can
lead to blow-up in size (e.g. a list of ints, and it should be noted
that containers with lots of elements bias towards having small
elements) and also writeElementByteCount(e) could be very inefficient
for many type e (e.g. a list of lists).


What is your proposal Robert then? Current restriction is clearly a blocker
for portability, users, determinism and is unsafe and only checkable at
runtime so not something we should lead to keep.

Alternative i thought about was to forbid implicit coders but it doesnt
help users.



> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation
but
> if you check coder code it is already here at the higher level so it is
not
> a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow
to
> mark the stream so there is no real fast way to read the stream as fast as
> possible with standard buffering strategies and to support this automatic
> IterableCoder wrapping which is implicit. In other words, if beam wants to
> support any coder, including the ones not requiring to write the size of
the
> output - most of the codecs - then we need to change the way it works to
> something like that which does it for the user which doesn't know its
coder
> got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book

Re: coder evolutions?

Posted by Robert Bradshaw <ro...@google.com>.

On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau
<rm...@gmail.com> wrote:
> Hi guys,
>
> I submitted a PR on coders to enhance 1. the user experience 2. the
> determinism and handling of coders.
>
> 1. the user experience is linked to what i sent some days ago: close
> handling of the streams from a coder code. Long story short I add a
> SkipCloseCoder which can decorate a coder and just wraps the stream (input
> or output) in flavors skipping close() calls. This avoids to do it by
> default (which had my preference if you read the related thread but not the
> one of everybody) but also makes the usage of a coder with this issue easy
> since the of() of the coder just wraps itself in this delagating coder.
>
> 2. this one is more nasty and mainly concerns IterableLikeCoders. These ones
> use this kind of algorithm (keep in mind they work on a list):
>
> writeSize()
> for all element e {
>     elementCoder.write(e)
> }
> writeMagicNumber() // this one depends the size
>
> The decoding is symmetric so I bypass it here.
>
> Indeed all these writes (reads) are done on the same stream. Therefore it
> assumes you read as much bytes than you write...which is a huge assumption
> for a coder which should by contract assume it can read the stream...as a
> stream (until -1).
>
> The idea of the fix is to change this encoding to this kind of algorithm:
>
> writeSize()
> for all element e {
>     writeElementByteCount(e)
>     elementCoder.write(e)
> }
> writeMagicNumber() // still optionally

Regardless of the backwards incompatibility issues, I'm unconvinced
that prefixing every element with its length is a good idea. It can
lead to blow-up in size (e.g. a list of ints, and it should be noted
that containers with lots of elements bias towards having small
elements) and also writeElementByteCount(e) could be very inefficient
for many type e (e.g. a list of lists).

> This way on the decode size you can wrap the stream by element to enforce
> the limitation of the byte count.
>
> Side note: this indeed enforce a limitation due to java byte limitation but
> if you check coder code it is already here at the higher level so it is not
> a big deal for now.
>
> In terms of implementation it uses a LengthAwareCoder which delegates to
> another coder the encoding and just adds the byte count before the actual
> serialization. Not perfect but should be more than enough in terms of
> support and perf for beam if you think real pipelines (we try to avoid
> serializations or it is done on some well known points where this algo
> should be enough...worse case it is not a huge overhead, mainly just some
> memory overhead).
>
>
> The PR is available at https://github.com/apache/beam/pull/4594. If you
> check you will see I put it "WIP". The main reason is that it changes the
> encoding format for containers (lists, iterable, ...) and therefore breaks
> python/go/... tests and the standard_coders.yml definition. Some help on
> that would be very welcomed.
>
> Technical side note if you wonder: UnownedInputStream doesn't even allow to
> mark the stream so there is no real fast way to read the stream as fast as
> possible with standard buffering strategies and to support this automatic
> IterableCoder wrapping which is implicit. In other words, if beam wants to
> support any coder, including the ones not requiring to write the size of the
> output - most of the codecs - then we need to change the way it works to
> something like that which does it for the user which doesn't know its coder
> got wrapped.
>
> Hope it makes sense, if not, don't hesitate to ask questions.
>
> Happy end of week-end.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book