You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Reuven Lax <re...@google.com> on 2018/06/03 19:44:46 UTC

Re: Beam SQL Improvements

Just an update: Romain and I chatted on Slack, and I think I understand his
concern. The concern wasn't specifically about schemas, rather about having
a generic way to register per-ParDo state that has worker lifetime. As
evidence that such is needed, in many cases static variables are used to
simiulate that. static variables however have downsides - if two pipelines
are run on the same JVM (happens often with unit tests, and there's nothing
that prevents a runner from doing so in a production environment), these
static variables will interfere with each other.

On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:

> Romain, maybe it would be useful for us to find some time on slack. I'd
> like to understand your concerns. Also keep in mind that I'm tagging all
> these classes as Experimental for now, so we can definitely change these
> interfaces around if we decide they are not the best ones.
>
> Reuven
>
> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Why not extending ProcessContext to add the new remapped output? But
>> looks good (the part i dont like is that creating a new context each time a
>> new feature is added is hurting users. What when beam will add some
>> reactive support? ReactiveOutputReceiver?)
>>
>> Pipeline sounds the wrong storage since once distributed you serialized
>> the instances so kind of broke the lifecycle of the original instance and
>> have no real release/close hook on them anymore right? Not sure we can do
>> better than dofn/source embedded instances today.
>>
>>
>>
>>
>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>> écrit :
>>
>>>
>>>
>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>
>>>> It would be easier for users.
>>>>
>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>> we do in Camel: Beam could check the source/destination "type" and check
>>>> in the map if there's a converter available. This map can be store as
>>>> part of the pipeline (as we do for filesystem registration).
>>>>
>>>
>>>
>>> It works in camel because it is not strongly typed, isnt it? So can
>>> require a beam new pipeline api.
>>>
>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>> wouldnt break the fluent api:
>>>
>>> p.apply(io).setOutputType(Foo.class)
>>>
>>> Coders can be a workaround since they owns the type but since the
>>> pcollection is the real owner it is surely saner this way, no?
>>>
>>> Also it needs to ensure all converters are present before running the
>>> pipeline probably, no implicit environment converter support is probably
>>> good to start to avoid late surprises.
>>>
>>>
>>>
>>>> My $0.01
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>> > How does it work on the pipeline side?
>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>> > API to work not erasing generics?
>>>> >
>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>> > SQL(row)->BigQuery(row)
>>>> >
>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>> pretask
>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>> duplicate
>>>> > it or not track it properly. These kind of converters will need a
>>>> global
>>>> > close and not only per record in general:
>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>> add
>>>> > it to setup/teardown pby, being more global would be nicer but is more
>>>> > challenging.
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> >     No - the only modules we need to add to core are the ones we
>>>> choose
>>>> >     to add. For example, I will probably add a registration for
>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>> >     with schemas. However I will add that to the GCP module, so only
>>>> >     someone depending on that module need to pull in that dependency.
>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>> >     register schemas for their types (we already do something similar
>>>> >     for FileSystem and for coders as well).
>>>> >
>>>> >     BTW, right now the conversion back and forth between Row objects
>>>> I'm
>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>> order
>>>> >     to invoke DoFns.
>>>> >
>>>> >     Reuven
>>>> >
>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>> >
>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>> with
>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>> >         schema API)
>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>> just
>>>> >         trying to understand here)?
>>>> >         If so, how an IO can use as() with the type it expects? Doesnt
>>>> >         it lead to have a tons of  these modules at the end?
>>>> >
>>>> >         Romain Manni-Bucau
>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>> >         <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>> >         <ma...@google.com>> a écrit :
>>>> >
>>>> >             By the way Romain, if you have specific scenarios in mind
>>>> I
>>>> >             would love to hear them. I can try and guess what exactly
>>>> >             you would like to get out of schemas, but it would work
>>>> >             better if you gave me concrete scenarios that you would
>>>> like
>>>> >             to work.
>>>> >
>>>> >             Reuven
>>>> >
>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>> relax@google.com
>>>> >             <ma...@google.com>> wrote:
>>>> >
>>>> >                 Yeah, what I'm working on will help with IO. Basically
>>>> >                 if you register a function with SchemaRegistry that
>>>> >                 converts back and forth between a type (say
>>>> JsonObject)
>>>> >                 and a Beam Row, then it is applied by the framework
>>>> >                 behind the scenes as part of DoFn invocation. Concrete
>>>> >                 example: let's say I have an IO that reads json
>>>> objects
>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>> >                 JsonObject> {...}
>>>> >
>>>> >                 If you register a schema for this type (or you can
>>>> also
>>>> >                 just set the schema directly on the output
>>>> PCollection),
>>>> >                 then Beam knows how to convert back and forth between
>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>> >
>>>> >                 p.apply(new MyJsonIORead())
>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>> >                     @ProcessElement void process(@Element Row row) {
>>>> >                    })
>>>> >
>>>> >                 And Beam will automatically convert JsonObject to a
>>>> Row
>>>> >                 for processing (you aren't forced to do this of
>>>> course -
>>>> >                 you can always ask for it as a JsonObject).
>>>> >
>>>> >                 The same is true for output. If you have a sink that
>>>> >                 takes in JsonObject but the transform before it
>>>> produces
>>>> >                 Row objects (for instance - because the transform
>>>> before
>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>> back
>>>> >                 to JsonObject for you.
>>>> >
>>>> >                 All of this was detailed in the Schema doc I shared a
>>>> >                 few months ago. There was a lot of discussion on that
>>>> >                 document from various parties, and some of this API
>>>> is a
>>>> >                 result of that discussion. This is also working in the
>>>> >                 branch JB and I were working on, though not yet
>>>> >                 integrated back to master.
>>>> >
>>>> >                 I would like to actually go further and make Row an
>>>> >                 interface and provide a way to automatically put a Row
>>>> >                 interface on top of any other object (e.g. JsonObject,
>>>> >                 Pojo, etc.) This won't change the way the user writes
>>>> >                 code, but instead of Beam having to copy and convert
>>>> at
>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>> will
>>>> >                 create a Row object that uses the the JsonObject as
>>>> its
>>>> >                 underlying storage.
>>>> >
>>>> >                 Reuven
>>>> >
>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>> >                 <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>> >>
>>>> >                 wrote:
>>>> >
>>>> >                     Well, beam can implement a new mapper but it
>>>> doesnt
>>>> >                     help for io. Most of modern backends will take
>>>> json
>>>> >                     directly, even javax one and it must stay generic.
>>>> >
>>>> >                     Then since json to pojo mapping is already done a
>>>> >                     dozen of times, not sure it is worth it for now.
>>>> >
>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>> >                     <relax@google.com <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                         We can do even better btw. Building a
>>>> >                         SchemaRegistry where automatic conversions can
>>>> >                         be registered between schema and Java data
>>>> >                         types. With this the user won't even need a
>>>> DoFn
>>>> >                         to do the conversion.
>>>> >
>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>> >                         <ma...@gmail.com>> wrote:
>>>> >
>>>> >                             Hi guys,
>>>> >
>>>> >                             Checked out what has been done on schema
>>>> >                             model and think it is acceptable -
>>>> regarding
>>>> >                             the json debate -
>>>> >                             if
>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>> >                             can be fixed.
>>>> >
>>>> >                             High level, it is about providing a
>>>> >                             mainstream and not too impacting model
>>>> OOTB
>>>> >                             and JSON seems the most valid option for
>>>> >                             now, at least for IO and some user
>>>> transforms.
>>>> >
>>>> >                             Wdyt?
>>>> >
>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>> >                             <ma...@gmail.com>> a écrit :
>>>> >
>>>> >                                  Can give it a try end of may, sure.
>>>> >                                 (holidays and work constraints will
>>>> make
>>>> >                                 it hard before).
>>>> >
>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>> >                                 <kedin@google.com
>>>> >                                 <ma...@google.com>> a écrit :
>>>> >
>>>> >                                     Romain,
>>>> >
>>>> >                                     I don't believe that JSON approach
>>>> >                                     was investigated very thoroughIy.
>>>> I
>>>> >                                     mentioned few reasons which will
>>>> >                                     make it not the best choice my
>>>> >                                     opinion, but I may be wrong. Can
>>>> you
>>>> >                                     put together a design doc or a
>>>> >                                     prototype?
>>>> >
>>>> >                                     Thank you,
>>>> >                                     Anton
>>>> >
>>>> >
>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>> >                                     Romain Manni-Bucau
>>>> >                                     <rmannibucau@gmail.com
>>>> >                                     <ma...@gmail.com>>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>>> >                                         Kedin" <kedin@google.com
>>>> >                                         <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                                             BeamRecord (Row) has very
>>>> >                                             little in common with
>>>> >                                             JsonObject (I assume
>>>> you're
>>>> >                                             talking about javax.json),
>>>> >                                             except maybe some
>>>> >                                             similarities of the API.
>>>> Few
>>>> >                                             reasons why JsonObject
>>>> >                                             doesn't work:
>>>> >
>>>> >                                               * it is a Java EE API:
>>>> >                                                   o Beam SDK is not
>>>> >                                                     limited to Java.
>>>> >                                                     There are probably
>>>> >                                                     similar APIs for
>>>> >                                                     other languages
>>>> but
>>>> >                                                     they might not
>>>> >                                                     necessarily carry
>>>> >                                                     the same
>>>> semantics /
>>>> >                                                     APIs;
>>>> >
>>>> >
>>>> >                                         Not a big deal I think. At
>>>> least
>>>> >                                         not a technical blocker.
>>>> >
>>>> >                                                   o It can change
>>>> >                                                     between Java
>>>> versions;
>>>> >
>>>> >                                         No, this is javaee ;).
>>>> >
>>>> >
>>>> >                                                   o Current Beam java
>>>> >                                                     implementation is
>>>> an
>>>> >                                                     experimental
>>>> feature
>>>> >                                                     to identify what's
>>>> >                                                     needed from such
>>>> >                                                     API, in the end we
>>>> >                                                     might end up with
>>>> >                                                     something similar
>>>> to
>>>> >                                                     JsonObject API,
>>>> but
>>>> >                                                     likely not
>>>> >
>>>> >
>>>> >                                         I dont get that point as a
>>>> blocker
>>>> >
>>>> >                                                   o ;
>>>> >                                               * represents JSON, which
>>>> >                                                 is not an API but an
>>>> >                                                 object notation:
>>>> >                                                   o it is defined as
>>>> >                                                     unicode string in
>>>> a
>>>> >                                                     certain format. If
>>>> >                                                     you choose to
>>>> adhere
>>>> >                                                     to ECMA-404, then
>>>> it
>>>> >                                                     doesn't sound like
>>>> >                                                     JsonObject can
>>>> >                                                     represent an Avro
>>>> >                                                     object, if I'm
>>>> >                                                     reading it right;
>>>> >
>>>> >
>>>> >                                         It is in the generator impl,
>>>> you
>>>> >                                         can impl an avrogenerator.
>>>> >
>>>> >                                               * doesn't define a type
>>>> >                                                 system (JSON does, but
>>>> >                                                 it's lacking):
>>>> >                                                   o for example, JSON
>>>> >                                                     doesn't define
>>>> >                                                     semantics for
>>>> numbers;
>>>> >                                                   o doesn't define
>>>> >                                                     date/time types;
>>>> >                                                   o doesn't allow
>>>> >                                                     extending JSON
>>>> type
>>>> >                                                     system at all;
>>>> >
>>>> >
>>>> >                                         That is why you need a metada
>>>> >                                         object, or simpler, a schema
>>>> >                                         with that data. Json or beam
>>>> >                                         record doesnt help here and
>>>> you
>>>> >                                         end up on the same outcome if
>>>> >                                         you think about it.
>>>> >
>>>> >                                               * lacks schemas;
>>>> >
>>>> >                                         Jsonschema are standard,
>>>> widely
>>>> >                                         spread and tooled compared to
>>>> >                                         alternative.
>>>> >
>>>> >                                             You can definitely try
>>>> >                                             loosen the requirements
>>>> and
>>>> >                                             define everything in JSON
>>>> in
>>>> >                                             userland, but the point of
>>>> >                                             Row/Schema is to avoid it
>>>> >                                             and define everything in
>>>> >                                             Beam model, which can be
>>>> >                                             extended, mapped to JSON,
>>>> >                                             Avro, BigQuery Schemas,
>>>> >                                             custom binary format etc.,
>>>> >                                             with same semantics across
>>>> >                                             beam SDKs.
>>>> >
>>>> >
>>>> >                                         This is what jsonp would allow
>>>> >                                         with the benefit of a natural
>>>> >                                         pojo support through jsonb.
>>>> >
>>>> >
>>>> >
>>>> >                                             On Thu, Apr 26, 2018 at
>>>> >                                             12:28 PM Romain
>>>> Manni-Bucau
>>>> >                                             <rmannibucau@gmail.com
>>>> >                                             <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                             wrote:
>>>> >
>>>> >                                                 Just to let it be
>>>> clear
>>>> >                                                 and let me understand:
>>>> >                                                 how is BeamRecord
>>>> >                                                 different from a
>>>> >                                                 JsonObject which is an
>>>> >                                                 API without
>>>> >                                                 implementation (not
>>>> >                                                 event a json one
>>>> OOTB)?
>>>> >                                                 Advantage of json
>>>> *api*
>>>> >                                                 are indeed natural
>>>> >                                                 mapping (jsonb is
>>>> based
>>>> >                                                 on jsonp so no new
>>>> >                                                 binding to reinvent)
>>>> and
>>>> >                                                 simple serialization
>>>> >                                                 (json+gzip for ex, or
>>>> >                                                 avro if you want to be
>>>> >                                                 geeky).
>>>> >
>>>> >                                                 I fail to see the
>>>> point
>>>> >                                                 to rebuild an
>>>> ecosystem ATM.
>>>> >
>>>> >                                                 Le 26 avr. 2018 19:12,
>>>> >                                                 "Reuven Lax"
>>>> >                                                 <relax@google.com
>>>> >                                                 <mailto:
>>>> relax@google.com>>
>>>> >                                                 a écrit :
>>>> >
>>>> >                                                     Exactly what JB
>>>> >                                                     said. We will
>>>> write
>>>> >                                                     a generic
>>>> conversion
>>>> >                                                     from Avro (or
>>>> json)
>>>> >                                                     to Beam schemas,
>>>> >                                                     which will make
>>>> them
>>>> >                                                     work transparently
>>>> >                                                     with SQL. The plan
>>>> >                                                     is also to migrate
>>>> >                                                     Anton's work so
>>>> that
>>>> >                                                     POJOs works
>>>> >                                                     generically for
>>>> any
>>>> >                                                     schema.
>>>> >
>>>> >                                                     Reuven
>>>> >
>>>> >                                                     On Thu, Apr 26,
>>>> 2018
>>>> >                                                     at 1:17 AM
>>>> >                                                     Jean-Baptiste
>>>> Onofré
>>>> >                                                     <jb@nanthrax.net
>>>> >                                                     <mailto:
>>>> jb@nanthrax.net>>
>>>> >                                                     wrote:
>>>> >
>>>> >                                                         For now we
>>>> have
>>>> >                                                         a generic
>>>> schema
>>>> >                                                         interface.
>>>> >                                                         Json-b can be
>>>> an
>>>> >                                                         impl, avro
>>>> could
>>>> >                                                         be another
>>>> one.
>>>> >
>>>> >                                                         Regards
>>>> >                                                         JB
>>>> >                                                         Le 26 avr.
>>>> 2018,
>>>> >                                                         à 12:08,
>>>> Romain
>>>> >                                                         Manni-Bucau
>>>> >                                                         <
>>>> rmannibucau@gmail.com
>>>> >                                                         <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                                         a écrit:
>>>> >
>>>> >                                                             Hmm,
>>>> >
>>>> >                                                             avro has
>>>> >                                                             still the
>>>> >                                                             pitfalls
>>>> to
>>>> >                                                             have an
>>>> >
>>>>  uncontrolled
>>>> >                                                             stack
>>>> which
>>>> >                                                             brings way
>>>> >                                                             too much
>>>> >
>>>>  dependencies
>>>> >                                                             to be part
>>>> >                                                             of any
>>>> API,
>>>> >                                                             this is
>>>> why
>>>> >                                                             I
>>>> proposed a
>>>> >                                                             JSON-P
>>>> based
>>>> >                                                             API
>>>> >
>>>>  (JsonObject)
>>>> >                                                             with a
>>>> >                                                             custom
>>>> beam
>>>> >                                                             entry for
>>>> >                                                             some
>>>> >                                                             metadata
>>>> >                                                             (headers
>>>> "à
>>>> >                                                             la
>>>> Camel").
>>>> >
>>>> >
>>>> >                                                             Romain
>>>> >
>>>>  Manni-Bucau
>>>> >
>>>>  @rmannibucau
>>>> >                                                             <
>>>> https://twitter.com/rmannibucau>
>>>> >                                                             |   Blog
>>>> >                                                             <
>>>> https://rmannibucau.metawerx.net/> |
>>>> >                                                             Old Blog
>>>> >                                                             <
>>>> http://rmannibucau.wordpress.com>
>>>> >                                                             |  Github
>>>> >                                                             <
>>>> https://github.com/rmannibucau> |
>>>> >                                                             LinkedIn
>>>> >                                                             <
>>>> https://www.linkedin.com/in/rmannibucau> |
>>>> >                                                             Book
>>>> >                                                             <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >                                                             2018-04-26
>>>> >                                                             9:59
>>>> >                                                             GMT+02:00
>>>> >
>>>>  Jean-Baptiste Onofré
>>>> >                                                             <
>>>> jb@nanthrax.net
>>>> >                                                             <mailto:
>>>> jb@nanthrax.net>>:
>>>> >
>>>> >
>>>> >                                                                 Hi
>>>> Ismael
>>>> >
>>>> >                                                                 You
>>>> mean
>>>> >
>>>>  directly
>>>> >                                                                 in
>>>> Beam
>>>> >                                                                 SQL ?
>>>> >
>>>> >                                                                 That
>>>> >                                                                 will
>>>> be
>>>> >                                                                 part
>>>> of
>>>> >                                                                 schema
>>>> >
>>>>  support:
>>>> >
>>>>  generic
>>>> >                                                                 record
>>>> >                                                                 could
>>>> be
>>>> >                                                                 one of
>>>> >                                                                 the
>>>> >
>>>>  payload
>>>> >                                                                 with
>>>> >                                                                 across
>>>> >
>>>>  schema.
>>>> >
>>>> >
>>>>  Regards
>>>> >                                                                 JB
>>>> >                                                                 Le 26
>>>> >                                                                 avr.
>>>> >                                                                 2018,
>>>> à
>>>> >                                                                 11:39,
>>>> >
>>>>  "Ismaël
>>>> >
>>>>  Mejía" <
>>>> >
>>>> iemejia@gmail.com
>>>> >
>>>>  <ma...@gmail.com>>
>>>> >                                                                 a
>>>> écrit:
>>>> >
>>>> >
>>>>  Hello Anton,
>>>> >
>>>> >
>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>> >
>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>> >
>>>>  is a natural fit for this approach too.
>>>> >
>>>> >
>>>>  Regards,
>>>> >
>>>>  Ismaël
>>>> >
>>>> >
>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>> >
>>>>  <ma...@google.com>> wrote:
>>>> >
>>>> >
>>>>
>>>> >
>>>> >
>>>>    Hi,
>>>> >
>>>> >
>>>>    I want
>>>> >
>>>>    to
>>>> >
>>>>    highlight
>>>> >
>>>>    a couple
>>>> >
>>>>    of
>>>> >
>>>>    improvements
>>>> >
>>>>    to
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    we
>>>> >
>>>>    have
>>>> >
>>>>    been
>>>> >
>>>> >
>>>>    working
>>>> >
>>>>    on
>>>> >
>>>>    recently
>>>> >
>>>>    which
>>>> >
>>>>    are
>>>> >
>>>>    targeted
>>>> >
>>>>    to
>>>> >
>>>>    make
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    API
>>>> >
>>>>    easier
>>>> >
>>>>    to
>>>> >
>>>>    use.
>>>> >
>>>> >
>>>>    Specifically
>>>> >
>>>>    these
>>>> >
>>>>    features
>>>> >
>>>>    simplify
>>>> >
>>>>    conversion
>>>> >
>>>>    of
>>>> >
>>>>    Java
>>>> >
>>>>    Beans
>>>> >
>>>>    and
>>>> >
>>>>    JSON
>>>> >
>>>> >
>>>>    strings
>>>> >
>>>>    to
>>>> >
>>>>    Rows.
>>>> >
>>>> >
>>>> >
>>>>    Feel
>>>> >
>>>>    free
>>>> >
>>>>    to
>>>> >
>>>>    try
>>>> >
>>>>    this
>>>> >
>>>>    and
>>>> >
>>>>    send
>>>> >
>>>>    any
>>>> >
>>>>    bugs/comments/PRs
>>>> >
>>>>    my
>>>> >
>>>>    way.
>>>> >
>>>> >
>>>> >
>>>>    **Caveat:
>>>> >
>>>>    this
>>>> >
>>>>    is
>>>> >
>>>>    still
>>>> >
>>>>    work
>>>> >
>>>>    in
>>>> >
>>>>    progress,
>>>> >
>>>>    and
>>>> >
>>>>    has
>>>> >
>>>>    known
>>>> >
>>>>    bugs
>>>> >
>>>>    and
>>>> >
>>>>    incomplete
>>>> >
>>>> >
>>>>    features,
>>>> >
>>>>    see
>>>> >
>>>>    below
>>>> >
>>>>    for
>>>> >
>>>>    details.**
>>>> >
>>>> >
>>>> >
>>>>    Background
>>>> >
>>>> >
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    queries
>>>> >
>>>>    can
>>>> >
>>>>    only
>>>> >
>>>>    be
>>>> >
>>>>    applied
>>>> >
>>>>    to
>>>> >
>>>>    PCollection<Row>.
>>>> >
>>>>    This
>>>> >
>>>>    means
>>>> >
>>>>    that
>>>> >
>>>> >
>>>>    users
>>>> >
>>>>    need
>>>> >
>>>>    to
>>>> >
>>>>    convert
>>>> >
>>>>    whatever
>>>> >
>>>>    PCollection
>>>> >
>>>>    elements
>>>> >
>>>>    they
>>>> >
>>>>    have
>>>> >
>>>>    to
>>>> >
>>>>    Rows
>>>> >
>>>>    before
>>>> >
>>>> >
>>>>    querying
>>>> >
>>>>    them
>>>> >
>>>>    with
>>>> >
>>>>    SQL.
>>>> >
>>>>    This
>>>> >
>>>>    usually
>>>> >
>>>>    requires
>>>> >
>>>>
>>>
>>>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
This can create other issues with IO if the runner is not designed for it
(like direct runner) so probably not something reliable for beam generic
part :(.

Le lun. 4 juin 2018 20:10, Lukasz Cwik <lc...@google.com> a écrit :

> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >     No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> >     to add. For example, I will probably add a registration for
>>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>>> >     someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>>> >     register schemas for their types (we already do something
>>>>>> similar
>>>>>> >     for FileSystem and for coders as well).
>>>>>> >
>>>>>> >     BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> >     to invoke DoFns.
>>>>>> >
>>>>>> >     Reuven
>>>>>> >
>>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> >         schema API)
>>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> >         trying to understand here)?
>>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> >         it lead to have a tons of  these modules at the end?
>>>>>> >
>>>>>> >         Romain Manni-Bucau
>>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> >         <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> >         <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >             By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> >             would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> >             you would like to get out of schemas, but it would work
>>>>>> >             better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> >             to work.
>>>>>> >
>>>>>> >             Reuven
>>>>>> >
>>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> >             <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> >                 if you register a function with SchemaRegistry that
>>>>>> >                 converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> >                 example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>>> >                 JsonObject> {...}
>>>>>> >
>>>>>> >                 If you register a schema for this type (or you can
>>>>>> also
>>>>>> >                 just set the schema directly on the output
>>>>>> PCollection),
>>>>>> >                 then Beam knows how to convert back and forth
>>>>>> between
>>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> >                 p.apply(new MyJsonIORead())
>>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>>> >                    })
>>>>>> >
>>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> >                 for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> >                 you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> >                 The same is true for output. If you have a sink that
>>>>>> >                 takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> >                 Row objects (for instance - because the transform
>>>>>> before
>>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> >                 to JsonObject for you.
>>>>>> >
>>>>>> >                 All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> >                 few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> >                 document from various parties, and some of this API
>>>>>> is a
>>>>>> >                 result of that discussion. This is also working in
>>>>>> the
>>>>>> >                 branch JB and I were working on, though not yet
>>>>>> >                 integrated back to master.
>>>>>> >
>>>>>> >                 I would like to actually go further and make Row an
>>>>>> >                 interface and provide a way to automatically put a
>>>>>> Row
>>>>>> >                 interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> >                 Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> >                 code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> >                 create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> >                 underlying storage.
>>>>>> >
>>>>>> >                 Reuven
>>>>>> >
>>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> >                 <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                 wrote:
>>>>>> >
>>>>>> >                     Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> >                     help for io. Most of modern backends will take
>>>>>> json
>>>>>> >                     directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> >                     Then since json to pojo mapping is already done
>>>>>> a
>>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                         We can do even better btw. Building a
>>>>>> >                         SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> >                         be registered between schema and Java data
>>>>>> >                         types. With this the user won't even need a
>>>>>> DoFn
>>>>>> >                         to do the conversion.
>>>>>> >
>>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                         <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >                             Hi guys,
>>>>>> >
>>>>>> >                             Checked out what has been done on schema
>>>>>> >                             model and think it is acceptable -
>>>>>> regarding
>>>>>> >                             the json debate -
>>>>>> >                             if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> >                             can be fixed.
>>>>>> >
>>>>>> >                             High level, it is about providing a
>>>>>> >                             mainstream and not too impacting model
>>>>>> OOTB
>>>>>> >                             and JSON seems the most valid option for
>>>>>> >                             now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> >                             Wdyt?
>>>>>> >
>>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                             <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                  Can give it a try end of may, sure.
>>>>>> >                                 (holidays and work constraints will
>>>>>> make
>>>>>> >                                 it hard before).
>>>>>> >
>>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> >                                 <kedin@google.com
>>>>>> >                                 <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                     Romain,
>>>>>> >
>>>>>> >                                     I don't believe that JSON
>>>>>> approach
>>>>>> >                                     was investigated very
>>>>>> thoroughIy. I
>>>>>> >                                     mentioned few reasons which will
>>>>>> >                                     make it not the best choice my
>>>>>> >                                     opinion, but I may be wrong.
>>>>>> Can you
>>>>>> >                                     put together a design doc or a
>>>>>> >                                     prototype?
>>>>>> >
>>>>>> >                                     Thank you,
>>>>>> >                                     Anton
>>>>>> >
>>>>>> >
>>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> >                                     Romain Manni-Bucau
>>>>>> >                                     <rmannibucau@gmail.com
>>>>>> >                                     <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                         Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> >                                         Kedin" <kedin@google.com
>>>>>> >                                         <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> >                                             BeamRecord (Row) has
>>>>>> very
>>>>>> >                                             little in common with
>>>>>> >                                             JsonObject (I assume
>>>>>> you're
>>>>>> >                                             talking about
>>>>>> javax.json),
>>>>>> >                                             except maybe some
>>>>>> >                                             similarities of the
>>>>>> API. Few
>>>>>> >                                             reasons why JsonObject
>>>>>> >                                             doesn't work:
>>>>>> >
>>>>>> >                                               * it is a Java EE API:
>>>>>> >                                                   o Beam SDK is not
>>>>>> >                                                     limited to Java.
>>>>>> >                                                     There are
>>>>>> probably
>>>>>> >                                                     similar APIs for
>>>>>> >                                                     other languages
>>>>>> but
>>>>>> >                                                     they might not
>>>>>> >                                                     necessarily
>>>>>> carry
>>>>>> >                                                     the same
>>>>>> semantics /
>>>>>> >                                                     APIs;
>>>>>> >
>>>>>> >
>>>>>> >                                         Not a big deal I think. At
>>>>>> least
>>>>>> >                                         not a technical blocker.
>>>>>> >
>>>>>> >                                                   o It can change
>>>>>> >                                                     between Java
>>>>>> versions;
>>>>>> >
>>>>>> >                                         No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> >                                                   o Current Beam
>>>>>> java
>>>>>> >                                                     implementation
>>>>>> is an
>>>>>> >                                                     experimental
>>>>>> feature
>>>>>> >                                                     to identify
>>>>>> what's
>>>>>> >                                                     needed from such
>>>>>> >                                                     API, in the end
>>>>>> we
>>>>>> >                                                     might end up
>>>>>> with
>>>>>> >                                                     something
>>>>>> similar to
>>>>>> >                                                     JsonObject API,
>>>>>> but
>>>>>> >                                                     likely not
>>>>>> >
>>>>>> >
>>>>>> >                                         I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> >                                                   o ;
>>>>>> >                                               * represents JSON,
>>>>>> which
>>>>>> >                                                 is not an API but an
>>>>>> >                                                 object notation:
>>>>>> >                                                   o it is defined as
>>>>>> >                                                     unicode string
>>>>>> in a
>>>>>> >                                                     certain format.
>>>>>> If
>>>>>> >                                                     you choose to
>>>>>> adhere
>>>>>> >                                                     to ECMA-404,
>>>>>> then it
>>>>>> >                                                     doesn't sound
>>>>>> like
>>>>>> >                                                     JsonObject can
>>>>>> >                                                     represent an
>>>>>> Avro
>>>>>> >                                                     object, if I'm
>>>>>> >                                                     reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> >                                         It is in the generator
>>>>>> impl, you
>>>>>> >                                         can impl an avrogenerator.
>>>>>> >
>>>>>> >                                               * doesn't define a
>>>>>> type
>>>>>> >                                                 system (JSON does,
>>>>>> but
>>>>>> >                                                 it's lacking):
>>>>>> >                                                   o for example,
>>>>>> JSON
>>>>>> >                                                     doesn't define
>>>>>> >                                                     semantics for
>>>>>> numbers;
>>>>>> >                                                   o doesn't define
>>>>>> >                                                     date/time types;
>>>>>> >                                                   o doesn't allow
>>>>>> >                                                     extending JSON
>>>>>> type
>>>>>> >                                                     system at all;
>>>>>> >
>>>>>> >
>>>>>> >                                         That is why you need a
>>>>>> metada
>>>>>> >                                         object, or simpler, a schema
>>>>>> >                                         with that data. Json or beam
>>>>>> >                                         record doesnt help here and
>>>>>> you
>>>>>> >                                         end up on the same outcome
>>>>>> if
>>>>>> >                                         you think about it.
>>>>>> >
>>>>>> >                                               * lacks schemas;
>>>>>> >
>>>>>> >                                         Jsonschema are standard,
>>>>>> widely
>>>>>> >                                         spread and tooled compared
>>>>>> to
>>>>>> >                                         alternative.
>>>>>> >
>>>>>> >                                             You can definitely try
>>>>>> >                                             loosen the requirements
>>>>>> and
>>>>>> >                                             define everything in
>>>>>> JSON in
>>>>>> >                                             userland, but the point
>>>>>> of
>>>>>> >                                             Row/Schema is to avoid
>>>>>> it
>>>>>> >                                             and define everything in
>>>>>> >                                             Beam model, which can be
>>>>>> >                                             extended, mapped to
>>>>>> JSON,
>>>>>> >                                             Avro, BigQuery Schemas,
>>>>>> >                                             custom binary format
>>>>>> etc.,
>>>>>> >                                             with same semantics
>>>>>> across
>>>>>> >                                             beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> >                                         This is what jsonp would
>>>>>> allow
>>>>>> >                                         with the benefit of a
>>>>>> natural
>>>>>> >                                         pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>>> >                                             12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> >                                             <rmannibucau@gmail.com
>>>>>> >                                             <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                             wrote:
>>>>>> >
>>>>>> >                                                 Just to let it be
>>>>>> clear
>>>>>> >                                                 and let me
>>>>>> understand:
>>>>>> >                                                 how is BeamRecord
>>>>>> >                                                 different from a
>>>>>> >                                                 JsonObject which is
>>>>>> an
>>>>>> >                                                 API without
>>>>>> >                                                 implementation (not
>>>>>> >                                                 event a json one
>>>>>> OOTB)?
>>>>>> >                                                 Advantage of json
>>>>>> *api*
>>>>>> >                                                 are indeed natural
>>>>>> >                                                 mapping (jsonb is
>>>>>> based
>>>>>> >                                                 on jsonp so no new
>>>>>> >                                                 binding to
>>>>>> reinvent) and
>>>>>> >                                                 simple serialization
>>>>>> >                                                 (json+gzip for ex,
>>>>>> or
>>>>>> >                                                 avro if you want to
>>>>>> be
>>>>>> >                                                 geeky).
>>>>>> >
>>>>>> >                                                 I fail to see the
>>>>>> point
>>>>>> >                                                 to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> >                                                 Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> >                                                 "Reuven Lax"
>>>>>> >                                                 <relax@google.com
>>>>>> >                                                 <mailto:
>>>>>> relax@google.com>>
>>>>>> >                                                 a écrit :
>>>>>> >
>>>>>> >                                                     Exactly what JB
>>>>>> >                                                     said. We will
>>>>>> write
>>>>>> >                                                     a generic
>>>>>> conversion
>>>>>> >                                                     from Avro (or
>>>>>> json)
>>>>>> >                                                     to Beam schemas,
>>>>>> >                                                     which will make
>>>>>> them
>>>>>> >                                                     work
>>>>>> transparently
>>>>>> >                                                     with SQL. The
>>>>>> plan
>>>>>> >                                                     is also to
>>>>>> migrate
>>>>>> >                                                     Anton's work so
>>>>>> that
>>>>>> >                                                     POJOs works
>>>>>> >                                                     generically for
>>>>>> any
>>>>>> >                                                     schema.
>>>>>> >
>>>>>> >                                                     Reuven
>>>>>> >
>>>>>> >                                                     On Thu, Apr 26,
>>>>>> 2018
>>>>>> >                                                     at 1:17 AM
>>>>>> >                                                     Jean-Baptiste
>>>>>> Onofré
>>>>>> >                                                     <
>>>>>> jb@nanthrax.net
>>>>>> >                                                     <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> >                                                     wrote:
>>>>>> >
>>>>>> >                                                         For now we
>>>>>> have
>>>>>> >                                                         a generic
>>>>>> schema
>>>>>> >                                                         interface.
>>>>>> >                                                         Json-b can
>>>>>> be an
>>>>>> >                                                         impl, avro
>>>>>> could
>>>>>> >                                                         be another
>>>>>> one.
>>>>>> >
>>>>>> >                                                         Regards
>>>>>> >                                                         JB
>>>>>> >                                                         Le 26 avr.
>>>>>> 2018,
>>>>>> >                                                         à 12:08,
>>>>>> Romain
>>>>>> >                                                         Manni-Bucau
>>>>>> >                                                         <
>>>>>> rmannibucau@gmail.com
>>>>>> >                                                         <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                                         a écrit:
>>>>>> >
>>>>>> >                                                             Hmm,
>>>>>> >
>>>>>> >                                                             avro has
>>>>>> >                                                             still
>>>>>> the
>>>>>> >
>>>>>>  pitfalls to
>>>>>> >                                                             have an
>>>>>> >
>>>>>>  uncontrolled
>>>>>> >                                                             stack
>>>>>> which
>>>>>> >                                                             brings
>>>>>> way
>>>>>> >                                                             too much
>>>>>> >
>>>>>>  dependencies
>>>>>> >                                                             to be
>>>>>> part
>>>>>> >                                                             of any
>>>>>> API,
>>>>>> >                                                             this is
>>>>>> why
>>>>>> >                                                             I
>>>>>> proposed a
>>>>>> >                                                             JSON-P
>>>>>> based
>>>>>> >                                                             API
>>>>>> >
>>>>>>  (JsonObject)
>>>>>> >                                                             with a
>>>>>> >                                                             custom
>>>>>> beam
>>>>>> >                                                             entry
>>>>>> for
>>>>>> >                                                             some
>>>>>> >                                                             metadata
>>>>>> >
>>>>>>  (headers "à
>>>>>> >                                                             la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> >                                                             Romain
>>>>>> >
>>>>>>  Manni-Bucau
>>>>>> >
>>>>>>  @rmannibucau
>>>>>> >                                                             <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> >                                                             |   Blog
>>>>>> >                                                             <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> >                                                             Old Blog
>>>>>> >                                                             <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> >                                                             |
>>>>>> Github
>>>>>> >                                                             <
>>>>>> https://github.com/rmannibucau> |
>>>>>> >                                                             LinkedIn
>>>>>> >                                                             <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> >                                                             Book
>>>>>> >                                                             <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>  2018-04-26
>>>>>> >                                                             9:59
>>>>>> >
>>>>>>  GMT+02:00
>>>>>> >
>>>>>>  Jean-Baptiste Onofré
>>>>>> >                                                             <
>>>>>> jb@nanthrax.net
>>>>>> >                                                             <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> >                                                                 Hi
>>>>>> Ismael
>>>>>> >
>>>>>> >                                                                 You
>>>>>> mean
>>>>>> >
>>>>>>  directly
>>>>>> >                                                                 in
>>>>>> Beam
>>>>>> >                                                                 SQL
>>>>>> ?
>>>>>> >
>>>>>> >                                                                 That
>>>>>> >
>>>>>>  will be
>>>>>> >
>>>>>>  part of
>>>>>> >
>>>>>>  schema
>>>>>> >
>>>>>>  support:
>>>>>> >
>>>>>>  generic
>>>>>> >
>>>>>>  record
>>>>>> >
>>>>>>  could be
>>>>>> >                                                                 one
>>>>>> of
>>>>>> >                                                                 the
>>>>>> >
>>>>>>  payload
>>>>>> >                                                                 with
>>>>>> >
>>>>>>  across
>>>>>> >
>>>>>>  schema.
>>>>>> >
>>>>>> >
>>>>>>  Regards
>>>>>> >                                                                 JB
>>>>>> >                                                                 Le
>>>>>> 26
>>>>>> >                                                                 avr.
>>>>>> >
>>>>>>  2018, à
>>>>>> >
>>>>>>  11:39,
>>>>>> >
>>>>>>  "Ismaël
>>>>>> >
>>>>>>  Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>>  <ma...@gmail.com>>
>>>>>> >                                                                 a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>>  Hello Anton,
>>>>>> >
>>>>>> >
>>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>>  is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>>  Regards,
>>>>>> >
>>>>>>  Ismaël
>>>>>> >
>>>>>> >
>>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>>  <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>>      Hi,
>>>>>> >
>>>>>> >
>>>>>>      I want
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      highlight
>>>>>> >
>>>>>>      a couple
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      improvements
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      we
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      been
>>>>>> >
>>>>>> >
>>>>>>      working
>>>>>> >
>>>>>>      on
>>>>>> >
>>>>>>      recently
>>>>>> >
>>>>>>      which
>>>>>> >
>>>>>>      are
>>>>>> >
>>>>>>      targeted
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      make
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      API
>>>>>> >
>>>>>>      easier
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      use.
>>>>>> >
>>>>>> >
>>>>>>      Specifically
>>>>>> >
>>>>>>      these
>>>>>> >
>>>>>>      features
>>>>>> >
>>>>>>      simplify
>>>>>> >
>>>>>>      conversion
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      Java
>>>>>> >
>>>>>>      Beans
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      JSON
>>>>>> >
>>>>>> >
>>>>>>      strings
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Feel
>>>>>> >
>>>>>>      free
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      try
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      send
>>>>>> >
>>>>>>      any
>>>>>> >
>>>>>>      bugs/comments/PRs
>>>>>> >
>>>>>>      my
>>>>>> >
>>>>>>      way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      **Caveat:
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      is
>>>>>> >
>>>>>>      still
>>>>>> >
>>>>>>      work
>>>>>> >
>>>>>>      in
>>>>>> >
>>>>>>      progress,
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      has
>>>>>> >
>>>>>>      known
>>>>>> >
>>>>>>      bugs
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      incomplete
>>>>>> >
>>>>>> >
>>>>>>      features,
>>>>>> >
>>>>>>      see
>>>>>> >
>>>>>>      below
>>>>>> >
>>>>>>      for
>>>>>> >
>>>>>>      details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      queries
>>>>>> >
>>>>>>      can
>>>>>> >
>>>>>>      only
>>>>>> >
>>>>>>      be
>>>>>> >
>>>>>>      applied
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      PCollection<Row>.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      means
>>>>>> >
>>>>>>      that
>>>>>> >
>>>>>> >
>>>>>>      users
>>>>>> >
>>>>>>      need
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      convert
>>>>>> >
>>>>>>      whatever
>>>>>> >
>>>>>>      PCollection
>>>>>> >
>>>>>>      elements
>>>>>> >
>>>>>>      they
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows
>>>>>> >
>>>>>>      before
>>>>>> >
>>>>>> >
>>>>>>      querying
>>>>>> >
>>>>>>      them
>>>>>> >
>>>>>>      with
>>>>>> >
>>>>>>      SQL.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      usually
>>>>>> >
>>>>>>      requires
>>>>>> >
>>>>>>
>>>>>
>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Does DirectRunner do this today?

On Mon, Jun 4, 2018 at 9:10 PM Lukasz Cwik <lc...@google.com> wrote:

> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >     No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> >     to add. For example, I will probably add a registration for
>>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>>> >     someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>>> >     register schemas for their types (we already do something
>>>>>> similar
>>>>>> >     for FileSystem and for coders as well).
>>>>>> >
>>>>>> >     BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> >     to invoke DoFns.
>>>>>> >
>>>>>> >     Reuven
>>>>>> >
>>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> >         schema API)
>>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> >         trying to understand here)?
>>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> >         it lead to have a tons of  these modules at the end?
>>>>>> >
>>>>>> >         Romain Manni-Bucau
>>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> >         <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> >         <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >             By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> >             would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> >             you would like to get out of schemas, but it would work
>>>>>> >             better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> >             to work.
>>>>>> >
>>>>>> >             Reuven
>>>>>> >
>>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> >             <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> >                 if you register a function with SchemaRegistry that
>>>>>> >                 converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> >                 example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>>> >                 JsonObject> {...}
>>>>>> >
>>>>>> >                 If you register a schema for this type (or you can
>>>>>> also
>>>>>> >                 just set the schema directly on the output
>>>>>> PCollection),
>>>>>> >                 then Beam knows how to convert back and forth
>>>>>> between
>>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> >                 p.apply(new MyJsonIORead())
>>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>>> >                    })
>>>>>> >
>>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> >                 for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> >                 you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> >                 The same is true for output. If you have a sink that
>>>>>> >                 takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> >                 Row objects (for instance - because the transform
>>>>>> before
>>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> >                 to JsonObject for you.
>>>>>> >
>>>>>> >                 All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> >                 few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> >                 document from various parties, and some of this API
>>>>>> is a
>>>>>> >                 result of that discussion. This is also working in
>>>>>> the
>>>>>> >                 branch JB and I were working on, though not yet
>>>>>> >                 integrated back to master.
>>>>>> >
>>>>>> >                 I would like to actually go further and make Row an
>>>>>> >                 interface and provide a way to automatically put a
>>>>>> Row
>>>>>> >                 interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> >                 Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> >                 code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> >                 create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> >                 underlying storage.
>>>>>> >
>>>>>> >                 Reuven
>>>>>> >
>>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> >                 <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                 wrote:
>>>>>> >
>>>>>> >                     Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> >                     help for io. Most of modern backends will take
>>>>>> json
>>>>>> >                     directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> >                     Then since json to pojo mapping is already done
>>>>>> a
>>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                         We can do even better btw. Building a
>>>>>> >                         SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> >                         be registered between schema and Java data
>>>>>> >                         types. With this the user won't even need a
>>>>>> DoFn
>>>>>> >                         to do the conversion.
>>>>>> >
>>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                         <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >                             Hi guys,
>>>>>> >
>>>>>> >                             Checked out what has been done on schema
>>>>>> >                             model and think it is acceptable -
>>>>>> regarding
>>>>>> >                             the json debate -
>>>>>> >                             if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> >                             can be fixed.
>>>>>> >
>>>>>> >                             High level, it is about providing a
>>>>>> >                             mainstream and not too impacting model
>>>>>> OOTB
>>>>>> >                             and JSON seems the most valid option for
>>>>>> >                             now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> >                             Wdyt?
>>>>>> >
>>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                             <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                  Can give it a try end of may, sure.
>>>>>> >                                 (holidays and work constraints will
>>>>>> make
>>>>>> >                                 it hard before).
>>>>>> >
>>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> >                                 <kedin@google.com
>>>>>> >                                 <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                     Romain,
>>>>>> >
>>>>>> >                                     I don't believe that JSON
>>>>>> approach
>>>>>> >                                     was investigated very
>>>>>> thoroughIy. I
>>>>>> >                                     mentioned few reasons which will
>>>>>> >                                     make it not the best choice my
>>>>>> >                                     opinion, but I may be wrong.
>>>>>> Can you
>>>>>> >                                     put together a design doc or a
>>>>>> >                                     prototype?
>>>>>> >
>>>>>> >                                     Thank you,
>>>>>> >                                     Anton
>>>>>> >
>>>>>> >
>>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> >                                     Romain Manni-Bucau
>>>>>> >                                     <rmannibucau@gmail.com
>>>>>> >                                     <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                         Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> >                                         Kedin" <kedin@google.com
>>>>>> >                                         <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> >                                             BeamRecord (Row) has
>>>>>> very
>>>>>> >                                             little in common with
>>>>>> >                                             JsonObject (I assume
>>>>>> you're
>>>>>> >                                             talking about
>>>>>> javax.json),
>>>>>> >                                             except maybe some
>>>>>> >                                             similarities of the
>>>>>> API. Few
>>>>>> >                                             reasons why JsonObject
>>>>>> >                                             doesn't work:
>>>>>> >
>>>>>> >                                               * it is a Java EE API:
>>>>>> >                                                   o Beam SDK is not
>>>>>> >                                                     limited to Java.
>>>>>> >                                                     There are
>>>>>> probably
>>>>>> >                                                     similar APIs for
>>>>>> >                                                     other languages
>>>>>> but
>>>>>> >                                                     they might not
>>>>>> >                                                     necessarily
>>>>>> carry
>>>>>> >                                                     the same
>>>>>> semantics /
>>>>>> >                                                     APIs;
>>>>>> >
>>>>>> >
>>>>>> >                                         Not a big deal I think. At
>>>>>> least
>>>>>> >                                         not a technical blocker.
>>>>>> >
>>>>>> >                                                   o It can change
>>>>>> >                                                     between Java
>>>>>> versions;
>>>>>> >
>>>>>> >                                         No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> >                                                   o Current Beam
>>>>>> java
>>>>>> >                                                     implementation
>>>>>> is an
>>>>>> >                                                     experimental
>>>>>> feature
>>>>>> >                                                     to identify
>>>>>> what's
>>>>>> >                                                     needed from such
>>>>>> >                                                     API, in the end
>>>>>> we
>>>>>> >                                                     might end up
>>>>>> with
>>>>>> >                                                     something
>>>>>> similar to
>>>>>> >                                                     JsonObject API,
>>>>>> but
>>>>>> >                                                     likely not
>>>>>> >
>>>>>> >
>>>>>> >                                         I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> >                                                   o ;
>>>>>> >                                               * represents JSON,
>>>>>> which
>>>>>> >                                                 is not an API but an
>>>>>> >                                                 object notation:
>>>>>> >                                                   o it is defined as
>>>>>> >                                                     unicode string
>>>>>> in a
>>>>>> >                                                     certain format.
>>>>>> If
>>>>>> >                                                     you choose to
>>>>>> adhere
>>>>>> >                                                     to ECMA-404,
>>>>>> then it
>>>>>> >                                                     doesn't sound
>>>>>> like
>>>>>> >                                                     JsonObject can
>>>>>> >                                                     represent an
>>>>>> Avro
>>>>>> >                                                     object, if I'm
>>>>>> >                                                     reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> >                                         It is in the generator
>>>>>> impl, you
>>>>>> >                                         can impl an avrogenerator.
>>>>>> >
>>>>>> >                                               * doesn't define a
>>>>>> type
>>>>>> >                                                 system (JSON does,
>>>>>> but
>>>>>> >                                                 it's lacking):
>>>>>> >                                                   o for example,
>>>>>> JSON
>>>>>> >                                                     doesn't define
>>>>>> >                                                     semantics for
>>>>>> numbers;
>>>>>> >                                                   o doesn't define
>>>>>> >                                                     date/time types;
>>>>>> >                                                   o doesn't allow
>>>>>> >                                                     extending JSON
>>>>>> type
>>>>>> >                                                     system at all;
>>>>>> >
>>>>>> >
>>>>>> >                                         That is why you need a
>>>>>> metada
>>>>>> >                                         object, or simpler, a schema
>>>>>> >                                         with that data. Json or beam
>>>>>> >                                         record doesnt help here and
>>>>>> you
>>>>>> >                                         end up on the same outcome
>>>>>> if
>>>>>> >                                         you think about it.
>>>>>> >
>>>>>> >                                               * lacks schemas;
>>>>>> >
>>>>>> >                                         Jsonschema are standard,
>>>>>> widely
>>>>>> >                                         spread and tooled compared
>>>>>> to
>>>>>> >                                         alternative.
>>>>>> >
>>>>>> >                                             You can definitely try
>>>>>> >                                             loosen the requirements
>>>>>> and
>>>>>> >                                             define everything in
>>>>>> JSON in
>>>>>> >                                             userland, but the point
>>>>>> of
>>>>>> >                                             Row/Schema is to avoid
>>>>>> it
>>>>>> >                                             and define everything in
>>>>>> >                                             Beam model, which can be
>>>>>> >                                             extended, mapped to
>>>>>> JSON,
>>>>>> >                                             Avro, BigQuery Schemas,
>>>>>> >                                             custom binary format
>>>>>> etc.,
>>>>>> >                                             with same semantics
>>>>>> across
>>>>>> >                                             beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> >                                         This is what jsonp would
>>>>>> allow
>>>>>> >                                         with the benefit of a
>>>>>> natural
>>>>>> >                                         pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>>> >                                             12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> >                                             <rmannibucau@gmail.com
>>>>>> >                                             <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                             wrote:
>>>>>> >
>>>>>> >                                                 Just to let it be
>>>>>> clear
>>>>>> >                                                 and let me
>>>>>> understand:
>>>>>> >                                                 how is BeamRecord
>>>>>> >                                                 different from a
>>>>>> >                                                 JsonObject which is
>>>>>> an
>>>>>> >                                                 API without
>>>>>> >                                                 implementation (not
>>>>>> >                                                 event a json one
>>>>>> OOTB)?
>>>>>> >                                                 Advantage of json
>>>>>> *api*
>>>>>> >                                                 are indeed natural
>>>>>> >                                                 mapping (jsonb is
>>>>>> based
>>>>>> >                                                 on jsonp so no new
>>>>>> >                                                 binding to
>>>>>> reinvent) and
>>>>>> >                                                 simple serialization
>>>>>> >                                                 (json+gzip for ex,
>>>>>> or
>>>>>> >                                                 avro if you want to
>>>>>> be
>>>>>> >                                                 geeky).
>>>>>> >
>>>>>> >                                                 I fail to see the
>>>>>> point
>>>>>> >                                                 to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> >                                                 Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> >                                                 "Reuven Lax"
>>>>>> >                                                 <relax@google.com
>>>>>> >                                                 <mailto:
>>>>>> relax@google.com>>
>>>>>> >                                                 a écrit :
>>>>>> >
>>>>>> >                                                     Exactly what JB
>>>>>> >                                                     said. We will
>>>>>> write
>>>>>> >                                                     a generic
>>>>>> conversion
>>>>>> >                                                     from Avro (or
>>>>>> json)
>>>>>> >                                                     to Beam schemas,
>>>>>> >                                                     which will make
>>>>>> them
>>>>>> >                                                     work
>>>>>> transparently
>>>>>> >                                                     with SQL. The
>>>>>> plan
>>>>>> >                                                     is also to
>>>>>> migrate
>>>>>> >                                                     Anton's work so
>>>>>> that
>>>>>> >                                                     POJOs works
>>>>>> >                                                     generically for
>>>>>> any
>>>>>> >                                                     schema.
>>>>>> >
>>>>>> >                                                     Reuven
>>>>>> >
>>>>>> >                                                     On Thu, Apr 26,
>>>>>> 2018
>>>>>> >                                                     at 1:17 AM
>>>>>> >                                                     Jean-Baptiste
>>>>>> Onofré
>>>>>> >                                                     <
>>>>>> jb@nanthrax.net
>>>>>> >                                                     <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> >                                                     wrote:
>>>>>> >
>>>>>> >                                                         For now we
>>>>>> have
>>>>>> >                                                         a generic
>>>>>> schema
>>>>>> >                                                         interface.
>>>>>> >                                                         Json-b can
>>>>>> be an
>>>>>> >                                                         impl, avro
>>>>>> could
>>>>>> >                                                         be another
>>>>>> one.
>>>>>> >
>>>>>> >                                                         Regards
>>>>>> >                                                         JB
>>>>>> >                                                         Le 26 avr.
>>>>>> 2018,
>>>>>> >                                                         à 12:08,
>>>>>> Romain
>>>>>> >                                                         Manni-Bucau
>>>>>> >                                                         <
>>>>>> rmannibucau@gmail.com
>>>>>> >                                                         <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                                         a écrit:
>>>>>> >
>>>>>> >                                                             Hmm,
>>>>>> >
>>>>>> >                                                             avro has
>>>>>> >                                                             still
>>>>>> the
>>>>>> >
>>>>>>  pitfalls to
>>>>>> >                                                             have an
>>>>>> >
>>>>>>  uncontrolled
>>>>>> >                                                             stack
>>>>>> which
>>>>>> >                                                             brings
>>>>>> way
>>>>>> >                                                             too much
>>>>>> >
>>>>>>  dependencies
>>>>>> >                                                             to be
>>>>>> part
>>>>>> >                                                             of any
>>>>>> API,
>>>>>> >                                                             this is
>>>>>> why
>>>>>> >                                                             I
>>>>>> proposed a
>>>>>> >                                                             JSON-P
>>>>>> based
>>>>>> >                                                             API
>>>>>> >
>>>>>>  (JsonObject)
>>>>>> >                                                             with a
>>>>>> >                                                             custom
>>>>>> beam
>>>>>> >                                                             entry
>>>>>> for
>>>>>> >                                                             some
>>>>>> >                                                             metadata
>>>>>> >
>>>>>>  (headers "à
>>>>>> >                                                             la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> >                                                             Romain
>>>>>> >
>>>>>>  Manni-Bucau
>>>>>> >
>>>>>>  @rmannibucau
>>>>>> >                                                             <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> >                                                             |   Blog
>>>>>> >                                                             <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> >                                                             Old Blog
>>>>>> >                                                             <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> >                                                             |
>>>>>> Github
>>>>>> >                                                             <
>>>>>> https://github.com/rmannibucau> |
>>>>>> >                                                             LinkedIn
>>>>>> >                                                             <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> >                                                             Book
>>>>>> >                                                             <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>  2018-04-26
>>>>>> >                                                             9:59
>>>>>> >
>>>>>>  GMT+02:00
>>>>>> >
>>>>>>  Jean-Baptiste Onofré
>>>>>> >                                                             <
>>>>>> jb@nanthrax.net
>>>>>> >                                                             <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> >                                                                 Hi
>>>>>> Ismael
>>>>>> >
>>>>>> >                                                                 You
>>>>>> mean
>>>>>> >
>>>>>>  directly
>>>>>> >                                                                 in
>>>>>> Beam
>>>>>> >                                                                 SQL
>>>>>> ?
>>>>>> >
>>>>>> >                                                                 That
>>>>>> >
>>>>>>  will be
>>>>>> >
>>>>>>  part of
>>>>>> >
>>>>>>  schema
>>>>>> >
>>>>>>  support:
>>>>>> >
>>>>>>  generic
>>>>>> >
>>>>>>  record
>>>>>> >
>>>>>>  could be
>>>>>> >                                                                 one
>>>>>> of
>>>>>> >                                                                 the
>>>>>> >
>>>>>>  payload
>>>>>> >                                                                 with
>>>>>> >
>>>>>>  across
>>>>>> >
>>>>>>  schema.
>>>>>> >
>>>>>> >
>>>>>>  Regards
>>>>>> >                                                                 JB
>>>>>> >                                                                 Le
>>>>>> 26
>>>>>> >                                                                 avr.
>>>>>> >
>>>>>>  2018, à
>>>>>> >
>>>>>>  11:39,
>>>>>> >
>>>>>>  "Ismaël
>>>>>> >
>>>>>>  Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>>  <ma...@gmail.com>>
>>>>>> >                                                                 a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>>  Hello Anton,
>>>>>> >
>>>>>> >
>>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>>  is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>>  Regards,
>>>>>> >
>>>>>>  Ismaël
>>>>>> >
>>>>>> >
>>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>>  <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>>      Hi,
>>>>>> >
>>>>>> >
>>>>>>      I want
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      highlight
>>>>>> >
>>>>>>      a couple
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      improvements
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      we
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      been
>>>>>> >
>>>>>> >
>>>>>>      working
>>>>>> >
>>>>>>      on
>>>>>> >
>>>>>>      recently
>>>>>> >
>>>>>>      which
>>>>>> >
>>>>>>      are
>>>>>> >
>>>>>>      targeted
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      make
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      API
>>>>>> >
>>>>>>      easier
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      use.
>>>>>> >
>>>>>> >
>>>>>>      Specifically
>>>>>> >
>>>>>>      these
>>>>>> >
>>>>>>      features
>>>>>> >
>>>>>>      simplify
>>>>>> >
>>>>>>      conversion
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      Java
>>>>>> >
>>>>>>      Beans
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      JSON
>>>>>> >
>>>>>> >
>>>>>>      strings
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Feel
>>>>>> >
>>>>>>      free
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      try
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      send
>>>>>> >
>>>>>>      any
>>>>>> >
>>>>>>      bugs/comments/PRs
>>>>>> >
>>>>>>      my
>>>>>> >
>>>>>>      way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      **Caveat:
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      is
>>>>>> >
>>>>>>      still
>>>>>> >
>>>>>>      work
>>>>>> >
>>>>>>      in
>>>>>> >
>>>>>>      progress,
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      has
>>>>>> >
>>>>>>      known
>>>>>> >
>>>>>>      bugs
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      incomplete
>>>>>> >
>>>>>> >
>>>>>>      features,
>>>>>> >
>>>>>>      see
>>>>>> >
>>>>>>      below
>>>>>> >
>>>>>>      for
>>>>>> >
>>>>>>      details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      queries
>>>>>> >
>>>>>>      can
>>>>>> >
>>>>>>      only
>>>>>> >
>>>>>>      be
>>>>>> >
>>>>>>      applied
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      PCollection<Row>.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      means
>>>>>> >
>>>>>>      that
>>>>>> >
>>>>>> >
>>>>>>      users
>>>>>> >
>>>>>>      need
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      convert
>>>>>> >
>>>>>>      whatever
>>>>>> >
>>>>>>      PCollection
>>>>>> >
>>>>>>      elements
>>>>>> >
>>>>>>      they
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows
>>>>>> >
>>>>>>      before
>>>>>> >
>>>>>> >
>>>>>>      querying
>>>>>> >
>>>>>>      them
>>>>>> >
>>>>>>      with
>>>>>> >
>>>>>>      SQL.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      usually
>>>>>> >
>>>>>>      requires
>>>>>> >
>>>>>>
>>>>>
>>>>>

Re: Beam SQL Improvements

Posted by Lukasz Cwik <lc...@google.com>.
Shouldn't the runner isolate each instance of the pipeline behind an
appropriate class loader?

On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:

> Just an update: Romain and I chatted on Slack, and I think I understand
> his concern. The concern wasn't specifically about schemas, rather about
> having a generic way to register per-ParDo state that has worker lifetime.
> As evidence that such is needed, in many cases static variables are used to
> simiulate that. static variables however have downsides - if two pipelines
> are run on the same JVM (happens often with unit tests, and there's nothing
> that prevents a runner from doing so in a production environment), these
> static variables will interfere with each other.
>
> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>
>> Romain, maybe it would be useful for us to find some time on slack. I'd
>> like to understand your concerns. Also keep in mind that I'm tagging all
>> these classes as Experimental for now, so we can definitely change these
>> interfaces around if we decide they are not the best ones.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Why not extending ProcessContext to add the new remapped output? But
>>> looks good (the part i dont like is that creating a new context each time a
>>> new feature is added is hurting users. What when beam will add some
>>> reactive support? ReactiveOutputReceiver?)
>>>
>>> Pipeline sounds the wrong storage since once distributed you serialized
>>> the instances so kind of broke the lifecycle of the original instance and
>>> have no real release/close hook on them anymore right? Not sure we can do
>>> better than dofn/source embedded instances today.
>>>
>>>
>>>
>>>
>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>
>>>>> It would be easier for users.
>>>>>
>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>> check
>>>>> in the map if there's a converter available. This map can be store as
>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>
>>>>
>>>>
>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>> require a beam new pipeline api.
>>>>
>>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>>> wouldnt break the fluent api:
>>>>
>>>> p.apply(io).setOutputType(Foo.class)
>>>>
>>>> Coders can be a workaround since they owns the type but since the
>>>> pcollection is the real owner it is surely saner this way, no?
>>>>
>>>> Also it needs to ensure all converters are present before running the
>>>> pipeline probably, no implicit environment converter support is probably
>>>> good to start to avoid late surprises.
>>>>
>>>>
>>>>
>>>>> My $0.01
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>> > How does it work on the pipeline side?
>>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>>> > API to work not erasing generics?
>>>>> >
>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>> > SQL(row)->BigQuery(row)
>>>>> >
>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>> pretask
>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>> duplicate
>>>>> > it or not track it properly. These kind of converters will need a
>>>>> global
>>>>> > close and not only per record in general:
>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>>> add
>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>> more
>>>>> > challenging.
>>>>> >
>>>>> > Romain Manni-Bucau
>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> > <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>> > <ma...@google.com>> a écrit :
>>>>> >
>>>>> >     No - the only modules we need to add to core are the ones we
>>>>> choose
>>>>> >     to add. For example, I will probably add a registration for
>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>> >     someone depending on that module need to pull in that dependency.
>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>> >     register schemas for their types (we already do something similar
>>>>> >     for FileSystem and for coders as well).
>>>>> >
>>>>> >     BTW, right now the conversion back and forth between Row objects
>>>>> I'm
>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>> order
>>>>> >     to invoke DoFns.
>>>>> >
>>>>> >     Reuven
>>>>> >
>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>> >
>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>> with
>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>> >         schema API)
>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>> just
>>>>> >         trying to understand here)?
>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>> Doesnt
>>>>> >         it lead to have a tons of  these modules at the end?
>>>>> >
>>>>> >         Romain Manni-Bucau
>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> >         <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>> >         <ma...@google.com>> a écrit :
>>>>> >
>>>>> >             By the way Romain, if you have specific scenarios in
>>>>> mind I
>>>>> >             would love to hear them. I can try and guess what exactly
>>>>> >             you would like to get out of schemas, but it would work
>>>>> >             better if you gave me concrete scenarios that you would
>>>>> like
>>>>> >             to work.
>>>>> >
>>>>> >             Reuven
>>>>> >
>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>> relax@google.com
>>>>> >             <ma...@google.com>> wrote:
>>>>> >
>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>> Basically
>>>>> >                 if you register a function with SchemaRegistry that
>>>>> >                 converts back and forth between a type (say
>>>>> JsonObject)
>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>> Concrete
>>>>> >                 example: let's say I have an IO that reads json
>>>>> objects
>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>> >                 JsonObject> {...}
>>>>> >
>>>>> >                 If you register a schema for this type (or you can
>>>>> also
>>>>> >                 just set the schema directly on the output
>>>>> PCollection),
>>>>> >                 then Beam knows how to convert back and forth between
>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>> >
>>>>> >                 p.apply(new MyJsonIORead())
>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>> >                    })
>>>>> >
>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>> Row
>>>>> >                 for processing (you aren't forced to do this of
>>>>> course -
>>>>> >                 you can always ask for it as a JsonObject).
>>>>> >
>>>>> >                 The same is true for output. If you have a sink that
>>>>> >                 takes in JsonObject but the transform before it
>>>>> produces
>>>>> >                 Row objects (for instance - because the transform
>>>>> before
>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>> back
>>>>> >                 to JsonObject for you.
>>>>> >
>>>>> >                 All of this was detailed in the Schema doc I shared a
>>>>> >                 few months ago. There was a lot of discussion on that
>>>>> >                 document from various parties, and some of this API
>>>>> is a
>>>>> >                 result of that discussion. This is also working in
>>>>> the
>>>>> >                 branch JB and I were working on, though not yet
>>>>> >                 integrated back to master.
>>>>> >
>>>>> >                 I would like to actually go further and make Row an
>>>>> >                 interface and provide a way to automatically put a
>>>>> Row
>>>>> >                 interface on top of any other object (e.g.
>>>>> JsonObject,
>>>>> >                 Pojo, etc.) This won't change the way the user writes
>>>>> >                 code, but instead of Beam having to copy and convert
>>>>> at
>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>> will
>>>>> >                 create a Row object that uses the the JsonObject as
>>>>> its
>>>>> >                 underlying storage.
>>>>> >
>>>>> >                 Reuven
>>>>> >
>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>> >                 <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>>> >>
>>>>> >                 wrote:
>>>>> >
>>>>> >                     Well, beam can implement a new mapper but it
>>>>> doesnt
>>>>> >                     help for io. Most of modern backends will take
>>>>> json
>>>>> >                     directly, even javax one and it must stay
>>>>> generic.
>>>>> >
>>>>> >                     Then since json to pojo mapping is already done a
>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>> >
>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>> écrit :
>>>>> >
>>>>> >                         We can do even better btw. Building a
>>>>> >                         SchemaRegistry where automatic conversions
>>>>> can
>>>>> >                         be registered between schema and Java data
>>>>> >                         types. With this the user won't even need a
>>>>> DoFn
>>>>> >                         to do the conversion.
>>>>> >
>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>> >                         <ma...@gmail.com>> wrote:
>>>>> >
>>>>> >                             Hi guys,
>>>>> >
>>>>> >                             Checked out what has been done on schema
>>>>> >                             model and think it is acceptable -
>>>>> regarding
>>>>> >                             the json debate -
>>>>> >                             if
>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>> >                             can be fixed.
>>>>> >
>>>>> >                             High level, it is about providing a
>>>>> >                             mainstream and not too impacting model
>>>>> OOTB
>>>>> >                             and JSON seems the most valid option for
>>>>> >                             now, at least for IO and some user
>>>>> transforms.
>>>>> >
>>>>> >                             Wdyt?
>>>>> >
>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>> >                             <ma...@gmail.com>> a
>>>>> écrit :
>>>>> >
>>>>> >                                  Can give it a try end of may, sure.
>>>>> >                                 (holidays and work constraints will
>>>>> make
>>>>> >                                 it hard before).
>>>>> >
>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>> >                                 <kedin@google.com
>>>>> >                                 <ma...@google.com>> a écrit :
>>>>> >
>>>>> >                                     Romain,
>>>>> >
>>>>> >                                     I don't believe that JSON
>>>>> approach
>>>>> >                                     was investigated very
>>>>> thoroughIy. I
>>>>> >                                     mentioned few reasons which will
>>>>> >                                     make it not the best choice my
>>>>> >                                     opinion, but I may be wrong. Can
>>>>> you
>>>>> >                                     put together a design doc or a
>>>>> >                                     prototype?
>>>>> >
>>>>> >                                     Thank you,
>>>>> >                                     Anton
>>>>> >
>>>>> >
>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>> >                                     Romain Manni-Bucau
>>>>> >                                     <rmannibucau@gmail.com
>>>>> >                                     <ma...@gmail.com>>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>>>> >                                         Kedin" <kedin@google.com
>>>>> >                                         <ma...@google.com>>
>>>>> a écrit :
>>>>> >
>>>>> >                                             BeamRecord (Row) has very
>>>>> >                                             little in common with
>>>>> >                                             JsonObject (I assume
>>>>> you're
>>>>> >                                             talking about
>>>>> javax.json),
>>>>> >                                             except maybe some
>>>>> >                                             similarities of the API.
>>>>> Few
>>>>> >                                             reasons why JsonObject
>>>>> >                                             doesn't work:
>>>>> >
>>>>> >                                               * it is a Java EE API:
>>>>> >                                                   o Beam SDK is not
>>>>> >                                                     limited to Java.
>>>>> >                                                     There are
>>>>> probably
>>>>> >                                                     similar APIs for
>>>>> >                                                     other languages
>>>>> but
>>>>> >                                                     they might not
>>>>> >                                                     necessarily carry
>>>>> >                                                     the same
>>>>> semantics /
>>>>> >                                                     APIs;
>>>>> >
>>>>> >
>>>>> >                                         Not a big deal I think. At
>>>>> least
>>>>> >                                         not a technical blocker.
>>>>> >
>>>>> >                                                   o It can change
>>>>> >                                                     between Java
>>>>> versions;
>>>>> >
>>>>> >                                         No, this is javaee ;).
>>>>> >
>>>>> >
>>>>> >                                                   o Current Beam java
>>>>> >                                                     implementation
>>>>> is an
>>>>> >                                                     experimental
>>>>> feature
>>>>> >                                                     to identify
>>>>> what's
>>>>> >                                                     needed from such
>>>>> >                                                     API, in the end
>>>>> we
>>>>> >                                                     might end up with
>>>>> >                                                     something
>>>>> similar to
>>>>> >                                                     JsonObject API,
>>>>> but
>>>>> >                                                     likely not
>>>>> >
>>>>> >
>>>>> >                                         I dont get that point as a
>>>>> blocker
>>>>> >
>>>>> >                                                   o ;
>>>>> >                                               * represents JSON,
>>>>> which
>>>>> >                                                 is not an API but an
>>>>> >                                                 object notation:
>>>>> >                                                   o it is defined as
>>>>> >                                                     unicode string
>>>>> in a
>>>>> >                                                     certain format.
>>>>> If
>>>>> >                                                     you choose to
>>>>> adhere
>>>>> >                                                     to ECMA-404,
>>>>> then it
>>>>> >                                                     doesn't sound
>>>>> like
>>>>> >                                                     JsonObject can
>>>>> >                                                     represent an Avro
>>>>> >                                                     object, if I'm
>>>>> >                                                     reading it right;
>>>>> >
>>>>> >
>>>>> >                                         It is in the generator impl,
>>>>> you
>>>>> >                                         can impl an avrogenerator.
>>>>> >
>>>>> >                                               * doesn't define a type
>>>>> >                                                 system (JSON does,
>>>>> but
>>>>> >                                                 it's lacking):
>>>>> >                                                   o for example, JSON
>>>>> >                                                     doesn't define
>>>>> >                                                     semantics for
>>>>> numbers;
>>>>> >                                                   o doesn't define
>>>>> >                                                     date/time types;
>>>>> >                                                   o doesn't allow
>>>>> >                                                     extending JSON
>>>>> type
>>>>> >                                                     system at all;
>>>>> >
>>>>> >
>>>>> >                                         That is why you need a metada
>>>>> >                                         object, or simpler, a schema
>>>>> >                                         with that data. Json or beam
>>>>> >                                         record doesnt help here and
>>>>> you
>>>>> >                                         end up on the same outcome if
>>>>> >                                         you think about it.
>>>>> >
>>>>> >                                               * lacks schemas;
>>>>> >
>>>>> >                                         Jsonschema are standard,
>>>>> widely
>>>>> >                                         spread and tooled compared to
>>>>> >                                         alternative.
>>>>> >
>>>>> >                                             You can definitely try
>>>>> >                                             loosen the requirements
>>>>> and
>>>>> >                                             define everything in
>>>>> JSON in
>>>>> >                                             userland, but the point
>>>>> of
>>>>> >                                             Row/Schema is to avoid it
>>>>> >                                             and define everything in
>>>>> >                                             Beam model, which can be
>>>>> >                                             extended, mapped to JSON,
>>>>> >                                             Avro, BigQuery Schemas,
>>>>> >                                             custom binary format
>>>>> etc.,
>>>>> >                                             with same semantics
>>>>> across
>>>>> >                                             beam SDKs.
>>>>> >
>>>>> >
>>>>> >                                         This is what jsonp would
>>>>> allow
>>>>> >                                         with the benefit of a natural
>>>>> >                                         pojo support through jsonb.
>>>>> >
>>>>> >
>>>>> >
>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>> >                                             12:28 PM Romain
>>>>> Manni-Bucau
>>>>> >                                             <rmannibucau@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> >                                             wrote:
>>>>> >
>>>>> >                                                 Just to let it be
>>>>> clear
>>>>> >                                                 and let me
>>>>> understand:
>>>>> >                                                 how is BeamRecord
>>>>> >                                                 different from a
>>>>> >                                                 JsonObject which is
>>>>> an
>>>>> >                                                 API without
>>>>> >                                                 implementation (not
>>>>> >                                                 event a json one
>>>>> OOTB)?
>>>>> >                                                 Advantage of json
>>>>> *api*
>>>>> >                                                 are indeed natural
>>>>> >                                                 mapping (jsonb is
>>>>> based
>>>>> >                                                 on jsonp so no new
>>>>> >                                                 binding to reinvent)
>>>>> and
>>>>> >                                                 simple serialization
>>>>> >                                                 (json+gzip for ex, or
>>>>> >                                                 avro if you want to
>>>>> be
>>>>> >                                                 geeky).
>>>>> >
>>>>> >                                                 I fail to see the
>>>>> point
>>>>> >                                                 to rebuild an
>>>>> ecosystem ATM.
>>>>> >
>>>>> >                                                 Le 26 avr. 2018
>>>>> 19:12,
>>>>> >                                                 "Reuven Lax"
>>>>> >                                                 <relax@google.com
>>>>> >                                                 <mailto:
>>>>> relax@google.com>>
>>>>> >                                                 a écrit :
>>>>> >
>>>>> >                                                     Exactly what JB
>>>>> >                                                     said. We will
>>>>> write
>>>>> >                                                     a generic
>>>>> conversion
>>>>> >                                                     from Avro (or
>>>>> json)
>>>>> >                                                     to Beam schemas,
>>>>> >                                                     which will make
>>>>> them
>>>>> >                                                     work
>>>>> transparently
>>>>> >                                                     with SQL. The
>>>>> plan
>>>>> >                                                     is also to
>>>>> migrate
>>>>> >                                                     Anton's work so
>>>>> that
>>>>> >                                                     POJOs works
>>>>> >                                                     generically for
>>>>> any
>>>>> >                                                     schema.
>>>>> >
>>>>> >                                                     Reuven
>>>>> >
>>>>> >                                                     On Thu, Apr 26,
>>>>> 2018
>>>>> >                                                     at 1:17 AM
>>>>> >                                                     Jean-Baptiste
>>>>> Onofré
>>>>> >                                                     <jb@nanthrax.net
>>>>> >                                                     <mailto:
>>>>> jb@nanthrax.net>>
>>>>> >                                                     wrote:
>>>>> >
>>>>> >                                                         For now we
>>>>> have
>>>>> >                                                         a generic
>>>>> schema
>>>>> >                                                         interface.
>>>>> >                                                         Json-b can
>>>>> be an
>>>>> >                                                         impl, avro
>>>>> could
>>>>> >                                                         be another
>>>>> one.
>>>>> >
>>>>> >                                                         Regards
>>>>> >                                                         JB
>>>>> >                                                         Le 26 avr.
>>>>> 2018,
>>>>> >                                                         à 12:08,
>>>>> Romain
>>>>> >                                                         Manni-Bucau
>>>>> >                                                         <
>>>>> rmannibucau@gmail.com
>>>>> >                                                         <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> >                                                         a écrit:
>>>>> >
>>>>> >                                                             Hmm,
>>>>> >
>>>>> >                                                             avro has
>>>>> >                                                             still the
>>>>> >                                                             pitfalls
>>>>> to
>>>>> >                                                             have an
>>>>> >
>>>>>  uncontrolled
>>>>> >                                                             stack
>>>>> which
>>>>> >                                                             brings
>>>>> way
>>>>> >                                                             too much
>>>>> >
>>>>>  dependencies
>>>>> >                                                             to be
>>>>> part
>>>>> >                                                             of any
>>>>> API,
>>>>> >                                                             this is
>>>>> why
>>>>> >                                                             I
>>>>> proposed a
>>>>> >                                                             JSON-P
>>>>> based
>>>>> >                                                             API
>>>>> >
>>>>>  (JsonObject)
>>>>> >                                                             with a
>>>>> >                                                             custom
>>>>> beam
>>>>> >                                                             entry for
>>>>> >                                                             some
>>>>> >                                                             metadata
>>>>> >                                                             (headers
>>>>> "à
>>>>> >                                                             la
>>>>> Camel").
>>>>> >
>>>>> >
>>>>> >                                                             Romain
>>>>> >
>>>>>  Manni-Bucau
>>>>> >
>>>>>  @rmannibucau
>>>>> >                                                             <
>>>>> https://twitter.com/rmannibucau>
>>>>> >                                                             |   Blog
>>>>> >                                                             <
>>>>> https://rmannibucau.metawerx.net/> |
>>>>> >                                                             Old Blog
>>>>> >                                                             <
>>>>> http://rmannibucau.wordpress.com>
>>>>> >                                                             |  Github
>>>>> >                                                             <
>>>>> https://github.com/rmannibucau> |
>>>>> >                                                             LinkedIn
>>>>> >                                                             <
>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>> >                                                             Book
>>>>> >                                                             <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>  2018-04-26
>>>>> >                                                             9:59
>>>>> >                                                             GMT+02:00
>>>>> >
>>>>>  Jean-Baptiste Onofré
>>>>> >                                                             <
>>>>> jb@nanthrax.net
>>>>> >                                                             <mailto:
>>>>> jb@nanthrax.net>>:
>>>>> >
>>>>> >
>>>>> >                                                                 Hi
>>>>> Ismael
>>>>> >
>>>>> >                                                                 You
>>>>> mean
>>>>> >
>>>>>  directly
>>>>> >                                                                 in
>>>>> Beam
>>>>> >                                                                 SQL ?
>>>>> >
>>>>> >                                                                 That
>>>>> >                                                                 will
>>>>> be
>>>>> >                                                                 part
>>>>> of
>>>>> >
>>>>>  schema
>>>>> >
>>>>>  support:
>>>>> >
>>>>>  generic
>>>>> >
>>>>>  record
>>>>> >
>>>>>  could be
>>>>> >                                                                 one
>>>>> of
>>>>> >                                                                 the
>>>>> >
>>>>>  payload
>>>>> >                                                                 with
>>>>> >
>>>>>  across
>>>>> >
>>>>>  schema.
>>>>> >
>>>>> >
>>>>>  Regards
>>>>> >                                                                 JB
>>>>> >                                                                 Le 26
>>>>> >                                                                 avr.
>>>>> >
>>>>>  2018, à
>>>>> >
>>>>>  11:39,
>>>>> >
>>>>>  "Ismaël
>>>>> >
>>>>>  Mejía" <
>>>>> >
>>>>> iemejia@gmail.com
>>>>> >
>>>>>  <ma...@gmail.com>>
>>>>> >                                                                 a
>>>>> écrit:
>>>>> >
>>>>> >
>>>>>  Hello Anton,
>>>>> >
>>>>> >
>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>> >
>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>> >
>>>>>  is a natural fit for this approach too.
>>>>> >
>>>>> >
>>>>>  Regards,
>>>>> >
>>>>>  Ismaël
>>>>> >
>>>>> >
>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>> >
>>>>>  <ma...@google.com>> wrote:
>>>>> >
>>>>> >
>>>>>
>>>>> >
>>>>> >
>>>>>    Hi,
>>>>> >
>>>>> >
>>>>>    I want
>>>>> >
>>>>>    to
>>>>> >
>>>>>    highlight
>>>>> >
>>>>>    a couple
>>>>> >
>>>>>    of
>>>>> >
>>>>>    improvements
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    we
>>>>> >
>>>>>    have
>>>>> >
>>>>>    been
>>>>> >
>>>>> >
>>>>>    working
>>>>> >
>>>>>    on
>>>>> >
>>>>>    recently
>>>>> >
>>>>>    which
>>>>> >
>>>>>    are
>>>>> >
>>>>>    targeted
>>>>> >
>>>>>    to
>>>>> >
>>>>>    make
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    API
>>>>> >
>>>>>    easier
>>>>> >
>>>>>    to
>>>>> >
>>>>>    use.
>>>>> >
>>>>> >
>>>>>    Specifically
>>>>> >
>>>>>    these
>>>>> >
>>>>>    features
>>>>> >
>>>>>    simplify
>>>>> >
>>>>>    conversion
>>>>> >
>>>>>    of
>>>>> >
>>>>>    Java
>>>>> >
>>>>>    Beans
>>>>> >
>>>>>    and
>>>>> >
>>>>>    JSON
>>>>> >
>>>>> >
>>>>>    strings
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Rows.
>>>>> >
>>>>> >
>>>>> >
>>>>>    Feel
>>>>> >
>>>>>    free
>>>>> >
>>>>>    to
>>>>> >
>>>>>    try
>>>>> >
>>>>>    this
>>>>> >
>>>>>    and
>>>>> >
>>>>>    send
>>>>> >
>>>>>    any
>>>>> >
>>>>>    bugs/comments/PRs
>>>>> >
>>>>>    my
>>>>> >
>>>>>    way.
>>>>> >
>>>>> >
>>>>> >
>>>>>    **Caveat:
>>>>> >
>>>>>    this
>>>>> >
>>>>>    is
>>>>> >
>>>>>    still
>>>>> >
>>>>>    work
>>>>> >
>>>>>    in
>>>>> >
>>>>>    progress,
>>>>> >
>>>>>    and
>>>>> >
>>>>>    has
>>>>> >
>>>>>    known
>>>>> >
>>>>>    bugs
>>>>> >
>>>>>    and
>>>>> >
>>>>>    incomplete
>>>>> >
>>>>> >
>>>>>    features,
>>>>> >
>>>>>    see
>>>>> >
>>>>>    below
>>>>> >
>>>>>    for
>>>>> >
>>>>>    details.**
>>>>> >
>>>>> >
>>>>> >
>>>>>    Background
>>>>> >
>>>>> >
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    queries
>>>>> >
>>>>>    can
>>>>> >
>>>>>    only
>>>>> >
>>>>>    be
>>>>> >
>>>>>    applied
>>>>> >
>>>>>    to
>>>>> >
>>>>>    PCollection<Row>.
>>>>> >
>>>>>    This
>>>>> >
>>>>>    means
>>>>> >
>>>>>    that
>>>>> >
>>>>> >
>>>>>    users
>>>>> >
>>>>>    need
>>>>> >
>>>>>    to
>>>>> >
>>>>>    convert
>>>>> >
>>>>>    whatever
>>>>> >
>>>>>    PCollection
>>>>> >
>>>>>    elements
>>>>> >
>>>>>    they
>>>>> >
>>>>>    have
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Rows
>>>>> >
>>>>>    before
>>>>> >
>>>>> >
>>>>>    querying
>>>>> >
>>>>>    them
>>>>> >
>>>>>    with
>>>>> >
>>>>>    SQL.
>>>>> >
>>>>>    This
>>>>> >
>>>>>    usually
>>>>> >
>>>>>    requires
>>>>> >
>>>>>
>>>>
>>>>