You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Reuven Lax <re...@google.com> on 2018/06/03 19:44:46 UTC
Re: Beam SQL Improvements
Just an update: Romain and I chatted on Slack, and I think I understand his
concern. The concern wasn't specifically about schemas, rather about having
a generic way to register per-ParDo state that has worker lifetime. As
evidence that such is needed, in many cases static variables are used to
simiulate that. static variables however have downsides - if two pipelines
are run on the same JVM (happens often with unit tests, and there's nothing
that prevents a runner from doing so in a production environment), these
static variables will interfere with each other.
On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
> Romain, maybe it would be useful for us to find some time on slack. I'd
> like to understand your concerns. Also keep in mind that I'm tagging all
> these classes as Experimental for now, so we can definitely change these
> interfaces around if we decide they are not the best ones.
>
> Reuven
>
> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Why not extending ProcessContext to add the new remapped output? But
>> looks good (the part i dont like is that creating a new context each time a
>> new feature is added is hurting users. What when beam will add some
>> reactive support? ReactiveOutputReceiver?)
>>
>> Pipeline sounds the wrong storage since once distributed you serialized
>> the instances so kind of broke the lifecycle of the original instance and
>> have no real release/close hook on them anymore right? Not sure we can do
>> better than dofn/source embedded instances today.
>>
>>
>>
>>
>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>> écrit :
>>
>>>
>>>
>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>
>>>> It would be easier for users.
>>>>
>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>> we do in Camel: Beam could check the source/destination "type" and check
>>>> in the map if there's a converter available. This map can be store as
>>>> part of the pipeline (as we do for filesystem registration).
>>>>
>>>
>>>
>>> It works in camel because it is not strongly typed, isnt it? So can
>>> require a beam new pipeline api.
>>>
>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>> wouldnt break the fluent api:
>>>
>>> p.apply(io).setOutputType(Foo.class)
>>>
>>> Coders can be a workaround since they owns the type but since the
>>> pcollection is the real owner it is surely saner this way, no?
>>>
>>> Also it needs to ensure all converters are present before running the
>>> pipeline probably, no implicit environment converter support is probably
>>> good to start to avoid late surprises.
>>>
>>>
>>>
>>>> My $0.01
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>> > How does it work on the pipeline side?
>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>> > API to work not erasing generics?
>>>> >
>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>> > SQL(row)->BigQuery(row)
>>>> >
>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>> pretask
>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>> duplicate
>>>> > it or not track it properly. These kind of converters will need a
>>>> global
>>>> > close and not only per record in general:
>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>> add
>>>> > it to setup/teardown pby, being more global would be nicer but is more
>>>> > challenging.
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> > No - the only modules we need to add to core are the ones we
>>>> choose
>>>> > to add. For example, I will probably add a registration for
>>>> > TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>> > with schemas. However I will add that to the GCP module, so only
>>>> > someone depending on that module need to pull in that dependency.
>>>> > The Java ServiceLoader framework can be used by these modules to
>>>> > register schemas for their types (we already do something similar
>>>> > for FileSystem and for coders as well).
>>>> >
>>>> > BTW, right now the conversion back and forth between Row objects
>>>> I'm
>>>> > doing in the ByteBuddy generated bytecode that we generate in
>>>> order
>>>> > to invoke DoFns.
>>>> >
>>>> > Reuven
>>>> >
>>>> > On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>> > <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>> >
>>>> > Hmm, the pluggability part is close to what I wanted to do
>>>> with
>>>> > JsonObject as a main API (to avoid to redo a "row" API and
>>>> > schema API)
>>>> > Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>> > beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>> just
>>>> > trying to understand here)?
>>>> > If so, how an IO can use as() with the type it expects? Doesnt
>>>> > it lead to have a tons of these modules at the end?
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> > By the way Romain, if you have specific scenarios in mind
>>>> I
>>>> > would love to hear them. I can try and guess what exactly
>>>> > you would like to get out of schemas, but it would work
>>>> > better if you gave me concrete scenarios that you would
>>>> like
>>>> > to work.
>>>> >
>>>> > Reuven
>>>> >
>>>> > On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>> relax@google.com
>>>> > <ma...@google.com>> wrote:
>>>> >
>>>> > Yeah, what I'm working on will help with IO. Basically
>>>> > if you register a function with SchemaRegistry that
>>>> > converts back and forth between a type (say
>>>> JsonObject)
>>>> > and a Beam Row, then it is applied by the framework
>>>> > behind the scenes as part of DoFn invocation. Concrete
>>>> > example: let's say I have an IO that reads json
>>>> objects
>>>> > class MyJsonIORead extends PTransform<PBegin,
>>>> > JsonObject> {...}
>>>> >
>>>> > If you register a schema for this type (or you can
>>>> also
>>>> > just set the schema directly on the output
>>>> PCollection),
>>>> > then Beam knows how to convert back and forth between
>>>> > JsonObject and Row. So the next ParDo can look like
>>>> >
>>>> > p.apply(new MyJsonIORead())
>>>> > .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>> > @ProcessElement void process(@Element Row row) {
>>>> > })
>>>> >
>>>> > And Beam will automatically convert JsonObject to a
>>>> Row
>>>> > for processing (you aren't forced to do this of
>>>> course -
>>>> > you can always ask for it as a JsonObject).
>>>> >
>>>> > The same is true for output. If you have a sink that
>>>> > takes in JsonObject but the transform before it
>>>> produces
>>>> > Row objects (for instance - because the transform
>>>> before
>>>> > it is Beam SQL), Beam can automatically convert Row
>>>> back
>>>> > to JsonObject for you.
>>>> >
>>>> > All of this was detailed in the Schema doc I shared a
>>>> > few months ago. There was a lot of discussion on that
>>>> > document from various parties, and some of this API
>>>> is a
>>>> > result of that discussion. This is also working in the
>>>> > branch JB and I were working on, though not yet
>>>> > integrated back to master.
>>>> >
>>>> > I would like to actually go further and make Row an
>>>> > interface and provide a way to automatically put a Row
>>>> > interface on top of any other object (e.g. JsonObject,
>>>> > Pojo, etc.) This won't change the way the user writes
>>>> > code, but instead of Beam having to copy and convert
>>>> at
>>>> > each stage (e.g. from JsonObject to Row) it simply
>>>> will
>>>> > create a Row object that uses the the JsonObject as
>>>> its
>>>> > underlying storage.
>>>> >
>>>> > Reuven
>>>> >
>>>> > On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>> > <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>> >>
>>>> > wrote:
>>>> >
>>>> > Well, beam can implement a new mapper but it
>>>> doesnt
>>>> > help for io. Most of modern backends will take
>>>> json
>>>> > directly, even javax one and it must stay generic.
>>>> >
>>>> > Then since json to pojo mapping is already done a
>>>> > dozen of times, not sure it is worth it for now.
>>>> >
>>>> > Le mar. 22 mai 2018 20:27, Reuven Lax
>>>> > <relax@google.com <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> > We can do even better btw. Building a
>>>> > SchemaRegistry where automatic conversions can
>>>> > be registered between schema and Java data
>>>> > types. With this the user won't even need a
>>>> DoFn
>>>> > to do the conversion.
>>>> >
>>>> > On Tue, May 22, 2018, 10:13 AM Romain
>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>> > <ma...@gmail.com>> wrote:
>>>> >
>>>> > Hi guys,
>>>> >
>>>> > Checked out what has been done on schema
>>>> > model and think it is acceptable -
>>>> regarding
>>>> > the json debate -
>>>> > if
>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>> > can be fixed.
>>>> >
>>>> > High level, it is about providing a
>>>> > mainstream and not too impacting model
>>>> OOTB
>>>> > and JSON seems the most valid option for
>>>> > now, at least for IO and some user
>>>> transforms.
>>>> >
>>>> > Wdyt?
>>>> >
>>>> > Le ven. 27 avr. 2018 18:36, Romain
>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>> > <ma...@gmail.com>> a écrit :
>>>> >
>>>> > Can give it a try end of may, sure.
>>>> > (holidays and work constraints will
>>>> make
>>>> > it hard before).
>>>> >
>>>> > Le 27 avr. 2018 18:26, "Anton Kedin"
>>>> > <kedin@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> > Romain,
>>>> >
>>>> > I don't believe that JSON approach
>>>> > was investigated very thoroughIy.
>>>> I
>>>> > mentioned few reasons which will
>>>> > make it not the best choice my
>>>> > opinion, but I may be wrong. Can
>>>> you
>>>> > put together a design doc or a
>>>> > prototype?
>>>> >
>>>> > Thank you,
>>>> > Anton
>>>> >
>>>> >
>>>> > On Thu, Apr 26, 2018 at 10:17 PM
>>>> > Romain Manni-Bucau
>>>> > <rmannibucau@gmail.com
>>>> > <ma...@gmail.com>>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> > Le 26 avr. 2018 23:13, "Anton
>>>> > Kedin" <kedin@google.com
>>>> > <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> > BeamRecord (Row) has very
>>>> > little in common with
>>>> > JsonObject (I assume
>>>> you're
>>>> > talking about javax.json),
>>>> > except maybe some
>>>> > similarities of the API.
>>>> Few
>>>> > reasons why JsonObject
>>>> > doesn't work:
>>>> >
>>>> > * it is a Java EE API:
>>>> > o Beam SDK is not
>>>> > limited to Java.
>>>> > There are probably
>>>> > similar APIs for
>>>> > other languages
>>>> but
>>>> > they might not
>>>> > necessarily carry
>>>> > the same
>>>> semantics /
>>>> > APIs;
>>>> >
>>>> >
>>>> > Not a big deal I think. At
>>>> least
>>>> > not a technical blocker.
>>>> >
>>>> > o It can change
>>>> > between Java
>>>> versions;
>>>> >
>>>> > No, this is javaee ;).
>>>> >
>>>> >
>>>> > o Current Beam java
>>>> > implementation is
>>>> an
>>>> > experimental
>>>> feature
>>>> > to identify what's
>>>> > needed from such
>>>> > API, in the end we
>>>> > might end up with
>>>> > something similar
>>>> to
>>>> > JsonObject API,
>>>> but
>>>> > likely not
>>>> >
>>>> >
>>>> > I dont get that point as a
>>>> blocker
>>>> >
>>>> > o ;
>>>> > * represents JSON, which
>>>> > is not an API but an
>>>> > object notation:
>>>> > o it is defined as
>>>> > unicode string in
>>>> a
>>>> > certain format. If
>>>> > you choose to
>>>> adhere
>>>> > to ECMA-404, then
>>>> it
>>>> > doesn't sound like
>>>> > JsonObject can
>>>> > represent an Avro
>>>> > object, if I'm
>>>> > reading it right;
>>>> >
>>>> >
>>>> > It is in the generator impl,
>>>> you
>>>> > can impl an avrogenerator.
>>>> >
>>>> > * doesn't define a type
>>>> > system (JSON does, but
>>>> > it's lacking):
>>>> > o for example, JSON
>>>> > doesn't define
>>>> > semantics for
>>>> numbers;
>>>> > o doesn't define
>>>> > date/time types;
>>>> > o doesn't allow
>>>> > extending JSON
>>>> type
>>>> > system at all;
>>>> >
>>>> >
>>>> > That is why you need a metada
>>>> > object, or simpler, a schema
>>>> > with that data. Json or beam
>>>> > record doesnt help here and
>>>> you
>>>> > end up on the same outcome if
>>>> > you think about it.
>>>> >
>>>> > * lacks schemas;
>>>> >
>>>> > Jsonschema are standard,
>>>> widely
>>>> > spread and tooled compared to
>>>> > alternative.
>>>> >
>>>> > You can definitely try
>>>> > loosen the requirements
>>>> and
>>>> > define everything in JSON
>>>> in
>>>> > userland, but the point of
>>>> > Row/Schema is to avoid it
>>>> > and define everything in
>>>> > Beam model, which can be
>>>> > extended, mapped to JSON,
>>>> > Avro, BigQuery Schemas,
>>>> > custom binary format etc.,
>>>> > with same semantics across
>>>> > beam SDKs.
>>>> >
>>>> >
>>>> > This is what jsonp would allow
>>>> > with the benefit of a natural
>>>> > pojo support through jsonb.
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Apr 26, 2018 at
>>>> > 12:28 PM Romain
>>>> Manni-Bucau
>>>> > <rmannibucau@gmail.com
>>>> > <mailto:
>>>> rmannibucau@gmail.com>>
>>>> > wrote:
>>>> >
>>>> > Just to let it be
>>>> clear
>>>> > and let me understand:
>>>> > how is BeamRecord
>>>> > different from a
>>>> > JsonObject which is an
>>>> > API without
>>>> > implementation (not
>>>> > event a json one
>>>> OOTB)?
>>>> > Advantage of json
>>>> *api*
>>>> > are indeed natural
>>>> > mapping (jsonb is
>>>> based
>>>> > on jsonp so no new
>>>> > binding to reinvent)
>>>> and
>>>> > simple serialization
>>>> > (json+gzip for ex, or
>>>> > avro if you want to be
>>>> > geeky).
>>>> >
>>>> > I fail to see the
>>>> point
>>>> > to rebuild an
>>>> ecosystem ATM.
>>>> >
>>>> > Le 26 avr. 2018 19:12,
>>>> > "Reuven Lax"
>>>> > <relax@google.com
>>>> > <mailto:
>>>> relax@google.com>>
>>>> > a écrit :
>>>> >
>>>> > Exactly what JB
>>>> > said. We will
>>>> write
>>>> > a generic
>>>> conversion
>>>> > from Avro (or
>>>> json)
>>>> > to Beam schemas,
>>>> > which will make
>>>> them
>>>> > work transparently
>>>> > with SQL. The plan
>>>> > is also to migrate
>>>> > Anton's work so
>>>> that
>>>> > POJOs works
>>>> > generically for
>>>> any
>>>> > schema.
>>>> >
>>>> > Reuven
>>>> >
>>>> > On Thu, Apr 26,
>>>> 2018
>>>> > at 1:17 AM
>>>> > Jean-Baptiste
>>>> Onofré
>>>> > <jb@nanthrax.net
>>>> > <mailto:
>>>> jb@nanthrax.net>>
>>>> > wrote:
>>>> >
>>>> > For now we
>>>> have
>>>> > a generic
>>>> schema
>>>> > interface.
>>>> > Json-b can be
>>>> an
>>>> > impl, avro
>>>> could
>>>> > be another
>>>> one.
>>>> >
>>>> > Regards
>>>> > JB
>>>> > Le 26 avr.
>>>> 2018,
>>>> > à 12:08,
>>>> Romain
>>>> > Manni-Bucau
>>>> > <
>>>> rmannibucau@gmail.com
>>>> > <mailto:
>>>> rmannibucau@gmail.com>>
>>>> > a écrit:
>>>> >
>>>> > Hmm,
>>>> >
>>>> > avro has
>>>> > still the
>>>> > pitfalls
>>>> to
>>>> > have an
>>>> >
>>>> uncontrolled
>>>> > stack
>>>> which
>>>> > brings way
>>>> > too much
>>>> >
>>>> dependencies
>>>> > to be part
>>>> > of any
>>>> API,
>>>> > this is
>>>> why
>>>> > I
>>>> proposed a
>>>> > JSON-P
>>>> based
>>>> > API
>>>> >
>>>> (JsonObject)
>>>> > with a
>>>> > custom
>>>> beam
>>>> > entry for
>>>> > some
>>>> > metadata
>>>> > (headers
>>>> "à
>>>> > la
>>>> Camel").
>>>> >
>>>> >
>>>> > Romain
>>>> >
>>>> Manni-Bucau
>>>> >
>>>> @rmannibucau
>>>> > <
>>>> https://twitter.com/rmannibucau>
>>>> > | Blog
>>>> > <
>>>> https://rmannibucau.metawerx.net/> |
>>>> > Old Blog
>>>> > <
>>>> http://rmannibucau.wordpress.com>
>>>> > | Github
>>>> > <
>>>> https://github.com/rmannibucau> |
>>>> > LinkedIn
>>>> > <
>>>> https://www.linkedin.com/in/rmannibucau> |
>>>> > Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > 2018-04-26
>>>> > 9:59
>>>> > GMT+02:00
>>>> >
>>>> Jean-Baptiste Onofré
>>>> > <
>>>> jb@nanthrax.net
>>>> > <mailto:
>>>> jb@nanthrax.net>>:
>>>> >
>>>> >
>>>> > Hi
>>>> Ismael
>>>> >
>>>> > You
>>>> mean
>>>> >
>>>> directly
>>>> > in
>>>> Beam
>>>> > SQL ?
>>>> >
>>>> > That
>>>> > will
>>>> be
>>>> > part
>>>> of
>>>> > schema
>>>> >
>>>> support:
>>>> >
>>>> generic
>>>> > record
>>>> > could
>>>> be
>>>> > one of
>>>> > the
>>>> >
>>>> payload
>>>> > with
>>>> > across
>>>> >
>>>> schema.
>>>> >
>>>> >
>>>> Regards
>>>> > JB
>>>> > Le 26
>>>> > avr.
>>>> > 2018,
>>>> à
>>>> > 11:39,
>>>> >
>>>> "Ismaël
>>>> >
>>>> Mejía" <
>>>> >
>>>> iemejia@gmail.com
>>>> >
>>>> <ma...@gmail.com>>
>>>> > a
>>>> écrit:
>>>> >
>>>> >
>>>> Hello Anton,
>>>> >
>>>> >
>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>> >
>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>> >
>>>> is a natural fit for this approach too.
>>>> >
>>>> >
>>>> Regards,
>>>> >
>>>> Ismaël
>>>> >
>>>> >
>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>> >
>>>> <ma...@google.com>> wrote:
>>>> >
>>>> >
>>>>
>>>> >
>>>> >
>>>> Hi,
>>>> >
>>>> >
>>>> I want
>>>> >
>>>> to
>>>> >
>>>> highlight
>>>> >
>>>> a couple
>>>> >
>>>> of
>>>> >
>>>> improvements
>>>> >
>>>> to
>>>> >
>>>> Beam
>>>> >
>>>> SQL
>>>> >
>>>> we
>>>> >
>>>> have
>>>> >
>>>> been
>>>> >
>>>> >
>>>> working
>>>> >
>>>> on
>>>> >
>>>> recently
>>>> >
>>>> which
>>>> >
>>>> are
>>>> >
>>>> targeted
>>>> >
>>>> to
>>>> >
>>>> make
>>>> >
>>>> Beam
>>>> >
>>>> SQL
>>>> >
>>>> API
>>>> >
>>>> easier
>>>> >
>>>> to
>>>> >
>>>> use.
>>>> >
>>>> >
>>>> Specifically
>>>> >
>>>> these
>>>> >
>>>> features
>>>> >
>>>> simplify
>>>> >
>>>> conversion
>>>> >
>>>> of
>>>> >
>>>> Java
>>>> >
>>>> Beans
>>>> >
>>>> and
>>>> >
>>>> JSON
>>>> >
>>>> >
>>>> strings
>>>> >
>>>> to
>>>> >
>>>> Rows.
>>>> >
>>>> >
>>>> >
>>>> Feel
>>>> >
>>>> free
>>>> >
>>>> to
>>>> >
>>>> try
>>>> >
>>>> this
>>>> >
>>>> and
>>>> >
>>>> send
>>>> >
>>>> any
>>>> >
>>>> bugs/comments/PRs
>>>> >
>>>> my
>>>> >
>>>> way.
>>>> >
>>>> >
>>>> >
>>>> **Caveat:
>>>> >
>>>> this
>>>> >
>>>> is
>>>> >
>>>> still
>>>> >
>>>> work
>>>> >
>>>> in
>>>> >
>>>> progress,
>>>> >
>>>> and
>>>> >
>>>> has
>>>> >
>>>> known
>>>> >
>>>> bugs
>>>> >
>>>> and
>>>> >
>>>> incomplete
>>>> >
>>>> >
>>>> features,
>>>> >
>>>> see
>>>> >
>>>> below
>>>> >
>>>> for
>>>> >
>>>> details.**
>>>> >
>>>> >
>>>> >
>>>> Background
>>>> >
>>>> >
>>>> >
>>>> Beam
>>>> >
>>>> SQL
>>>> >
>>>> queries
>>>> >
>>>> can
>>>> >
>>>> only
>>>> >
>>>> be
>>>> >
>>>> applied
>>>> >
>>>> to
>>>> >
>>>> PCollection<Row>.
>>>> >
>>>> This
>>>> >
>>>> means
>>>> >
>>>> that
>>>> >
>>>> >
>>>> users
>>>> >
>>>> need
>>>> >
>>>> to
>>>> >
>>>> convert
>>>> >
>>>> whatever
>>>> >
>>>> PCollection
>>>> >
>>>> elements
>>>> >
>>>> they
>>>> >
>>>> have
>>>> >
>>>> to
>>>> >
>>>> Rows
>>>> >
>>>> before
>>>> >
>>>> >
>>>> querying
>>>> >
>>>> them
>>>> >
>>>> with
>>>> >
>>>> SQL.
>>>> >
>>>> This
>>>> >
>>>> usually
>>>> >
>>>> requires
>>>> >
>>>>
>>>
>>>
Re: Beam SQL Improvements
Posted by Romain Manni-Bucau <rm...@gmail.com>.
This can create other issues with IO if the runner is not designed for it
(like direct runner) so probably not something reliable for beam generic
part :(.
Le lun. 4 juin 2018 20:10, Lukasz Cwik <lc...@google.com> a écrit :
> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> > No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> > to add. For example, I will probably add a registration for
>>>>>> > TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> > with schemas. However I will add that to the GCP module, so only
>>>>>> > someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> > The Java ServiceLoader framework can be used by these modules to
>>>>>> > register schemas for their types (we already do something
>>>>>> similar
>>>>>> > for FileSystem and for coders as well).
>>>>>> >
>>>>>> > BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> > doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> > to invoke DoFns.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> > Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> > JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> > schema API)
>>>>>> > Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> > beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> > trying to understand here)?
>>>>>> > If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> > it lead to have a tons of these modules at the end?
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> > By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> > would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> > you would like to get out of schemas, but it would work
>>>>>> > better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> > to work.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> > <ma...@google.com>> wrote:
>>>>>> >
>>>>>> > Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> > if you register a function with SchemaRegistry that
>>>>>> > converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> > and a Beam Row, then it is applied by the framework
>>>>>> > behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> > example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> > class MyJsonIORead extends PTransform<PBegin,
>>>>>> > JsonObject> {...}
>>>>>> >
>>>>>> > If you register a schema for this type (or you can
>>>>>> also
>>>>>> > just set the schema directly on the output
>>>>>> PCollection),
>>>>>> > then Beam knows how to convert back and forth
>>>>>> between
>>>>>> > JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> > p.apply(new MyJsonIORead())
>>>>>> > .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> > @ProcessElement void process(@Element Row row) {
>>>>>> > })
>>>>>> >
>>>>>> > And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> > for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> > you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> > The same is true for output. If you have a sink that
>>>>>> > takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> > Row objects (for instance - because the transform
>>>>>> before
>>>>>> > it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> > to JsonObject for you.
>>>>>> >
>>>>>> > All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> > few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> > document from various parties, and some of this API
>>>>>> is a
>>>>>> > result of that discussion. This is also working in
>>>>>> the
>>>>>> > branch JB and I were working on, though not yet
>>>>>> > integrated back to master.
>>>>>> >
>>>>>> > I would like to actually go further and make Row an
>>>>>> > interface and provide a way to automatically put a
>>>>>> Row
>>>>>> > interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> > Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> > code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> > each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> > create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> > underlying storage.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> > help for io. Most of modern backends will take
>>>>>> json
>>>>>> > directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> > Then since json to pojo mapping is already done
>>>>>> a
>>>>>> > dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> > Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> > <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > We can do even better btw. Building a
>>>>>> > SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> > be registered between schema and Java data
>>>>>> > types. With this the user won't even need a
>>>>>> DoFn
>>>>>> > to do the conversion.
>>>>>> >
>>>>>> > On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> > Hi guys,
>>>>>> >
>>>>>> > Checked out what has been done on schema
>>>>>> > model and think it is acceptable -
>>>>>> regarding
>>>>>> > the json debate -
>>>>>> > if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> > can be fixed.
>>>>>> >
>>>>>> > High level, it is about providing a
>>>>>> > mainstream and not too impacting model
>>>>>> OOTB
>>>>>> > and JSON seems the most valid option for
>>>>>> > now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> > Wdyt?
>>>>>> >
>>>>>> > Le ven. 27 avr. 2018 18:36, Romain
>>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > Can give it a try end of may, sure.
>>>>>> > (holidays and work constraints will
>>>>>> make
>>>>>> > it hard before).
>>>>>> >
>>>>>> > Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> > <kedin@google.com
>>>>>> > <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > Romain,
>>>>>> >
>>>>>> > I don't believe that JSON
>>>>>> approach
>>>>>> > was investigated very
>>>>>> thoroughIy. I
>>>>>> > mentioned few reasons which will
>>>>>> > make it not the best choice my
>>>>>> > opinion, but I may be wrong.
>>>>>> Can you
>>>>>> > put together a design doc or a
>>>>>> > prototype?
>>>>>> >
>>>>>> > Thank you,
>>>>>> > Anton
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> > Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> > Kedin" <kedin@google.com
>>>>>> > <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> > BeamRecord (Row) has
>>>>>> very
>>>>>> > little in common with
>>>>>> > JsonObject (I assume
>>>>>> you're
>>>>>> > talking about
>>>>>> javax.json),
>>>>>> > except maybe some
>>>>>> > similarities of the
>>>>>> API. Few
>>>>>> > reasons why JsonObject
>>>>>> > doesn't work:
>>>>>> >
>>>>>> > * it is a Java EE API:
>>>>>> > o Beam SDK is not
>>>>>> > limited to Java.
>>>>>> > There are
>>>>>> probably
>>>>>> > similar APIs for
>>>>>> > other languages
>>>>>> but
>>>>>> > they might not
>>>>>> > necessarily
>>>>>> carry
>>>>>> > the same
>>>>>> semantics /
>>>>>> > APIs;
>>>>>> >
>>>>>> >
>>>>>> > Not a big deal I think. At
>>>>>> least
>>>>>> > not a technical blocker.
>>>>>> >
>>>>>> > o It can change
>>>>>> > between Java
>>>>>> versions;
>>>>>> >
>>>>>> > No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> > o Current Beam
>>>>>> java
>>>>>> > implementation
>>>>>> is an
>>>>>> > experimental
>>>>>> feature
>>>>>> > to identify
>>>>>> what's
>>>>>> > needed from such
>>>>>> > API, in the end
>>>>>> we
>>>>>> > might end up
>>>>>> with
>>>>>> > something
>>>>>> similar to
>>>>>> > JsonObject API,
>>>>>> but
>>>>>> > likely not
>>>>>> >
>>>>>> >
>>>>>> > I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> > o ;
>>>>>> > * represents JSON,
>>>>>> which
>>>>>> > is not an API but an
>>>>>> > object notation:
>>>>>> > o it is defined as
>>>>>> > unicode string
>>>>>> in a
>>>>>> > certain format.
>>>>>> If
>>>>>> > you choose to
>>>>>> adhere
>>>>>> > to ECMA-404,
>>>>>> then it
>>>>>> > doesn't sound
>>>>>> like
>>>>>> > JsonObject can
>>>>>> > represent an
>>>>>> Avro
>>>>>> > object, if I'm
>>>>>> > reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> > It is in the generator
>>>>>> impl, you
>>>>>> > can impl an avrogenerator.
>>>>>> >
>>>>>> > * doesn't define a
>>>>>> type
>>>>>> > system (JSON does,
>>>>>> but
>>>>>> > it's lacking):
>>>>>> > o for example,
>>>>>> JSON
>>>>>> > doesn't define
>>>>>> > semantics for
>>>>>> numbers;
>>>>>> > o doesn't define
>>>>>> > date/time types;
>>>>>> > o doesn't allow
>>>>>> > extending JSON
>>>>>> type
>>>>>> > system at all;
>>>>>> >
>>>>>> >
>>>>>> > That is why you need a
>>>>>> metada
>>>>>> > object, or simpler, a schema
>>>>>> > with that data. Json or beam
>>>>>> > record doesnt help here and
>>>>>> you
>>>>>> > end up on the same outcome
>>>>>> if
>>>>>> > you think about it.
>>>>>> >
>>>>>> > * lacks schemas;
>>>>>> >
>>>>>> > Jsonschema are standard,
>>>>>> widely
>>>>>> > spread and tooled compared
>>>>>> to
>>>>>> > alternative.
>>>>>> >
>>>>>> > You can definitely try
>>>>>> > loosen the requirements
>>>>>> and
>>>>>> > define everything in
>>>>>> JSON in
>>>>>> > userland, but the point
>>>>>> of
>>>>>> > Row/Schema is to avoid
>>>>>> it
>>>>>> > and define everything in
>>>>>> > Beam model, which can be
>>>>>> > extended, mapped to
>>>>>> JSON,
>>>>>> > Avro, BigQuery Schemas,
>>>>>> > custom binary format
>>>>>> etc.,
>>>>>> > with same semantics
>>>>>> across
>>>>>> > beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> > This is what jsonp would
>>>>>> allow
>>>>>> > with the benefit of a
>>>>>> natural
>>>>>> > pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Apr 26, 2018 at
>>>>>> > 12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> > <rmannibucau@gmail.com
>>>>>> > <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > Just to let it be
>>>>>> clear
>>>>>> > and let me
>>>>>> understand:
>>>>>> > how is BeamRecord
>>>>>> > different from a
>>>>>> > JsonObject which is
>>>>>> an
>>>>>> > API without
>>>>>> > implementation (not
>>>>>> > event a json one
>>>>>> OOTB)?
>>>>>> > Advantage of json
>>>>>> *api*
>>>>>> > are indeed natural
>>>>>> > mapping (jsonb is
>>>>>> based
>>>>>> > on jsonp so no new
>>>>>> > binding to
>>>>>> reinvent) and
>>>>>> > simple serialization
>>>>>> > (json+gzip for ex,
>>>>>> or
>>>>>> > avro if you want to
>>>>>> be
>>>>>> > geeky).
>>>>>> >
>>>>>> > I fail to see the
>>>>>> point
>>>>>> > to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> > Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> > "Reuven Lax"
>>>>>> > <relax@google.com
>>>>>> > <mailto:
>>>>>> relax@google.com>>
>>>>>> > a écrit :
>>>>>> >
>>>>>> > Exactly what JB
>>>>>> > said. We will
>>>>>> write
>>>>>> > a generic
>>>>>> conversion
>>>>>> > from Avro (or
>>>>>> json)
>>>>>> > to Beam schemas,
>>>>>> > which will make
>>>>>> them
>>>>>> > work
>>>>>> transparently
>>>>>> > with SQL. The
>>>>>> plan
>>>>>> > is also to
>>>>>> migrate
>>>>>> > Anton's work so
>>>>>> that
>>>>>> > POJOs works
>>>>>> > generically for
>>>>>> any
>>>>>> > schema.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Thu, Apr 26,
>>>>>> 2018
>>>>>> > at 1:17 AM
>>>>>> > Jean-Baptiste
>>>>>> Onofré
>>>>>> > <
>>>>>> jb@nanthrax.net
>>>>>> > <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > For now we
>>>>>> have
>>>>>> > a generic
>>>>>> schema
>>>>>> > interface.
>>>>>> > Json-b can
>>>>>> be an
>>>>>> > impl, avro
>>>>>> could
>>>>>> > be another
>>>>>> one.
>>>>>> >
>>>>>> > Regards
>>>>>> > JB
>>>>>> > Le 26 avr.
>>>>>> 2018,
>>>>>> > à 12:08,
>>>>>> Romain
>>>>>> > Manni-Bucau
>>>>>> > <
>>>>>> rmannibucau@gmail.com
>>>>>> > <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > a écrit:
>>>>>> >
>>>>>> > Hmm,
>>>>>> >
>>>>>> > avro has
>>>>>> > still
>>>>>> the
>>>>>> >
>>>>>> pitfalls to
>>>>>> > have an
>>>>>> >
>>>>>> uncontrolled
>>>>>> > stack
>>>>>> which
>>>>>> > brings
>>>>>> way
>>>>>> > too much
>>>>>> >
>>>>>> dependencies
>>>>>> > to be
>>>>>> part
>>>>>> > of any
>>>>>> API,
>>>>>> > this is
>>>>>> why
>>>>>> > I
>>>>>> proposed a
>>>>>> > JSON-P
>>>>>> based
>>>>>> > API
>>>>>> >
>>>>>> (JsonObject)
>>>>>> > with a
>>>>>> > custom
>>>>>> beam
>>>>>> > entry
>>>>>> for
>>>>>> > some
>>>>>> > metadata
>>>>>> >
>>>>>> (headers "à
>>>>>> > la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> > Romain
>>>>>> >
>>>>>> Manni-Bucau
>>>>>> >
>>>>>> @rmannibucau
>>>>>> > <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> > | Blog
>>>>>> > <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> > Old Blog
>>>>>> > <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> > |
>>>>>> Github
>>>>>> > <
>>>>>> https://github.com/rmannibucau> |
>>>>>> > LinkedIn
>>>>>> > <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> > Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> 2018-04-26
>>>>>> > 9:59
>>>>>> >
>>>>>> GMT+02:00
>>>>>> >
>>>>>> Jean-Baptiste Onofré
>>>>>> > <
>>>>>> jb@nanthrax.net
>>>>>> > <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> > Hi
>>>>>> Ismael
>>>>>> >
>>>>>> > You
>>>>>> mean
>>>>>> >
>>>>>> directly
>>>>>> > in
>>>>>> Beam
>>>>>> > SQL
>>>>>> ?
>>>>>> >
>>>>>> > That
>>>>>> >
>>>>>> will be
>>>>>> >
>>>>>> part of
>>>>>> >
>>>>>> schema
>>>>>> >
>>>>>> support:
>>>>>> >
>>>>>> generic
>>>>>> >
>>>>>> record
>>>>>> >
>>>>>> could be
>>>>>> > one
>>>>>> of
>>>>>> > the
>>>>>> >
>>>>>> payload
>>>>>> > with
>>>>>> >
>>>>>> across
>>>>>> >
>>>>>> schema.
>>>>>> >
>>>>>> >
>>>>>> Regards
>>>>>> > JB
>>>>>> > Le
>>>>>> 26
>>>>>> > avr.
>>>>>> >
>>>>>> 2018, à
>>>>>> >
>>>>>> 11:39,
>>>>>> >
>>>>>> "Ismaël
>>>>>> >
>>>>>> Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>> <ma...@gmail.com>>
>>>>>> > a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>> Hello Anton,
>>>>>> >
>>>>>> >
>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>> is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>> Regards,
>>>>>> >
>>>>>> Ismaël
>>>>>> >
>>>>>> >
>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>> <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> Hi,
>>>>>> >
>>>>>> >
>>>>>> I want
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> highlight
>>>>>> >
>>>>>> a couple
>>>>>> >
>>>>>> of
>>>>>> >
>>>>>> improvements
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> we
>>>>>> >
>>>>>> have
>>>>>> >
>>>>>> been
>>>>>> >
>>>>>> >
>>>>>> working
>>>>>> >
>>>>>> on
>>>>>> >
>>>>>> recently
>>>>>> >
>>>>>> which
>>>>>> >
>>>>>> are
>>>>>> >
>>>>>> targeted
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> make
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> API
>>>>>> >
>>>>>> easier
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> use.
>>>>>> >
>>>>>> >
>>>>>> Specifically
>>>>>> >
>>>>>> these
>>>>>> >
>>>>>> features
>>>>>> >
>>>>>> simplify
>>>>>> >
>>>>>> conversion
>>>>>> >
>>>>>> of
>>>>>> >
>>>>>> Java
>>>>>> >
>>>>>> Beans
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> JSON
>>>>>> >
>>>>>> >
>>>>>> strings
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Feel
>>>>>> >
>>>>>> free
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> try
>>>>>> >
>>>>>> this
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> send
>>>>>> >
>>>>>> any
>>>>>> >
>>>>>> bugs/comments/PRs
>>>>>> >
>>>>>> my
>>>>>> >
>>>>>> way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> **Caveat:
>>>>>> >
>>>>>> this
>>>>>> >
>>>>>> is
>>>>>> >
>>>>>> still
>>>>>> >
>>>>>> work
>>>>>> >
>>>>>> in
>>>>>> >
>>>>>> progress,
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> has
>>>>>> >
>>>>>> known
>>>>>> >
>>>>>> bugs
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> incomplete
>>>>>> >
>>>>>> >
>>>>>> features,
>>>>>> >
>>>>>> see
>>>>>> >
>>>>>> below
>>>>>> >
>>>>>> for
>>>>>> >
>>>>>> details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> queries
>>>>>> >
>>>>>> can
>>>>>> >
>>>>>> only
>>>>>> >
>>>>>> be
>>>>>> >
>>>>>> applied
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> PCollection<Row>.
>>>>>> >
>>>>>> This
>>>>>> >
>>>>>> means
>>>>>> >
>>>>>> that
>>>>>> >
>>>>>> >
>>>>>> users
>>>>>> >
>>>>>> need
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> convert
>>>>>> >
>>>>>> whatever
>>>>>> >
>>>>>> PCollection
>>>>>> >
>>>>>> elements
>>>>>> >
>>>>>> they
>>>>>> >
>>>>>> have
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Rows
>>>>>> >
>>>>>> before
>>>>>> >
>>>>>> >
>>>>>> querying
>>>>>> >
>>>>>> them
>>>>>> >
>>>>>> with
>>>>>> >
>>>>>> SQL.
>>>>>> >
>>>>>> This
>>>>>> >
>>>>>> usually
>>>>>> >
>>>>>> requires
>>>>>> >
>>>>>>
>>>>>
>>>>>
Re: Beam SQL Improvements
Posted by Reuven Lax <re...@google.com>.
Does DirectRunner do this today?
On Mon, Jun 4, 2018 at 9:10 PM Lukasz Cwik <lc...@google.com> wrote:
> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> > No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> > to add. For example, I will probably add a registration for
>>>>>> > TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> > with schemas. However I will add that to the GCP module, so only
>>>>>> > someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> > The Java ServiceLoader framework can be used by these modules to
>>>>>> > register schemas for their types (we already do something
>>>>>> similar
>>>>>> > for FileSystem and for coders as well).
>>>>>> >
>>>>>> > BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> > doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> > to invoke DoFns.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> > Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> > JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> > schema API)
>>>>>> > Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> > beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> > trying to understand here)?
>>>>>> > If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> > it lead to have a tons of these modules at the end?
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> > By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> > would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> > you would like to get out of schemas, but it would work
>>>>>> > better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> > to work.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> > <ma...@google.com>> wrote:
>>>>>> >
>>>>>> > Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> > if you register a function with SchemaRegistry that
>>>>>> > converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> > and a Beam Row, then it is applied by the framework
>>>>>> > behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> > example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> > class MyJsonIORead extends PTransform<PBegin,
>>>>>> > JsonObject> {...}
>>>>>> >
>>>>>> > If you register a schema for this type (or you can
>>>>>> also
>>>>>> > just set the schema directly on the output
>>>>>> PCollection),
>>>>>> > then Beam knows how to convert back and forth
>>>>>> between
>>>>>> > JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> > p.apply(new MyJsonIORead())
>>>>>> > .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> > @ProcessElement void process(@Element Row row) {
>>>>>> > })
>>>>>> >
>>>>>> > And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> > for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> > you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> > The same is true for output. If you have a sink that
>>>>>> > takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> > Row objects (for instance - because the transform
>>>>>> before
>>>>>> > it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> > to JsonObject for you.
>>>>>> >
>>>>>> > All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> > few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> > document from various parties, and some of this API
>>>>>> is a
>>>>>> > result of that discussion. This is also working in
>>>>>> the
>>>>>> > branch JB and I were working on, though not yet
>>>>>> > integrated back to master.
>>>>>> >
>>>>>> > I would like to actually go further and make Row an
>>>>>> > interface and provide a way to automatically put a
>>>>>> Row
>>>>>> > interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> > Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> > code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> > each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> > create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> > underlying storage.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> > help for io. Most of modern backends will take
>>>>>> json
>>>>>> > directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> > Then since json to pojo mapping is already done
>>>>>> a
>>>>>> > dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> > Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> > <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > We can do even better btw. Building a
>>>>>> > SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> > be registered between schema and Java data
>>>>>> > types. With this the user won't even need a
>>>>>> DoFn
>>>>>> > to do the conversion.
>>>>>> >
>>>>>> > On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> > Hi guys,
>>>>>> >
>>>>>> > Checked out what has been done on schema
>>>>>> > model and think it is acceptable -
>>>>>> regarding
>>>>>> > the json debate -
>>>>>> > if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> > can be fixed.
>>>>>> >
>>>>>> > High level, it is about providing a
>>>>>> > mainstream and not too impacting model
>>>>>> OOTB
>>>>>> > and JSON seems the most valid option for
>>>>>> > now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> > Wdyt?
>>>>>> >
>>>>>> > Le ven. 27 avr. 2018 18:36, Romain
>>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > Can give it a try end of may, sure.
>>>>>> > (holidays and work constraints will
>>>>>> make
>>>>>> > it hard before).
>>>>>> >
>>>>>> > Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> > <kedin@google.com
>>>>>> > <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> > Romain,
>>>>>> >
>>>>>> > I don't believe that JSON
>>>>>> approach
>>>>>> > was investigated very
>>>>>> thoroughIy. I
>>>>>> > mentioned few reasons which will
>>>>>> > make it not the best choice my
>>>>>> > opinion, but I may be wrong.
>>>>>> Can you
>>>>>> > put together a design doc or a
>>>>>> > prototype?
>>>>>> >
>>>>>> > Thank you,
>>>>>> > Anton
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> > Romain Manni-Bucau
>>>>>> > <rmannibucau@gmail.com
>>>>>> > <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> > Kedin" <kedin@google.com
>>>>>> > <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> > BeamRecord (Row) has
>>>>>> very
>>>>>> > little in common with
>>>>>> > JsonObject (I assume
>>>>>> you're
>>>>>> > talking about
>>>>>> javax.json),
>>>>>> > except maybe some
>>>>>> > similarities of the
>>>>>> API. Few
>>>>>> > reasons why JsonObject
>>>>>> > doesn't work:
>>>>>> >
>>>>>> > * it is a Java EE API:
>>>>>> > o Beam SDK is not
>>>>>> > limited to Java.
>>>>>> > There are
>>>>>> probably
>>>>>> > similar APIs for
>>>>>> > other languages
>>>>>> but
>>>>>> > they might not
>>>>>> > necessarily
>>>>>> carry
>>>>>> > the same
>>>>>> semantics /
>>>>>> > APIs;
>>>>>> >
>>>>>> >
>>>>>> > Not a big deal I think. At
>>>>>> least
>>>>>> > not a technical blocker.
>>>>>> >
>>>>>> > o It can change
>>>>>> > between Java
>>>>>> versions;
>>>>>> >
>>>>>> > No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> > o Current Beam
>>>>>> java
>>>>>> > implementation
>>>>>> is an
>>>>>> > experimental
>>>>>> feature
>>>>>> > to identify
>>>>>> what's
>>>>>> > needed from such
>>>>>> > API, in the end
>>>>>> we
>>>>>> > might end up
>>>>>> with
>>>>>> > something
>>>>>> similar to
>>>>>> > JsonObject API,
>>>>>> but
>>>>>> > likely not
>>>>>> >
>>>>>> >
>>>>>> > I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> > o ;
>>>>>> > * represents JSON,
>>>>>> which
>>>>>> > is not an API but an
>>>>>> > object notation:
>>>>>> > o it is defined as
>>>>>> > unicode string
>>>>>> in a
>>>>>> > certain format.
>>>>>> If
>>>>>> > you choose to
>>>>>> adhere
>>>>>> > to ECMA-404,
>>>>>> then it
>>>>>> > doesn't sound
>>>>>> like
>>>>>> > JsonObject can
>>>>>> > represent an
>>>>>> Avro
>>>>>> > object, if I'm
>>>>>> > reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> > It is in the generator
>>>>>> impl, you
>>>>>> > can impl an avrogenerator.
>>>>>> >
>>>>>> > * doesn't define a
>>>>>> type
>>>>>> > system (JSON does,
>>>>>> but
>>>>>> > it's lacking):
>>>>>> > o for example,
>>>>>> JSON
>>>>>> > doesn't define
>>>>>> > semantics for
>>>>>> numbers;
>>>>>> > o doesn't define
>>>>>> > date/time types;
>>>>>> > o doesn't allow
>>>>>> > extending JSON
>>>>>> type
>>>>>> > system at all;
>>>>>> >
>>>>>> >
>>>>>> > That is why you need a
>>>>>> metada
>>>>>> > object, or simpler, a schema
>>>>>> > with that data. Json or beam
>>>>>> > record doesnt help here and
>>>>>> you
>>>>>> > end up on the same outcome
>>>>>> if
>>>>>> > you think about it.
>>>>>> >
>>>>>> > * lacks schemas;
>>>>>> >
>>>>>> > Jsonschema are standard,
>>>>>> widely
>>>>>> > spread and tooled compared
>>>>>> to
>>>>>> > alternative.
>>>>>> >
>>>>>> > You can definitely try
>>>>>> > loosen the requirements
>>>>>> and
>>>>>> > define everything in
>>>>>> JSON in
>>>>>> > userland, but the point
>>>>>> of
>>>>>> > Row/Schema is to avoid
>>>>>> it
>>>>>> > and define everything in
>>>>>> > Beam model, which can be
>>>>>> > extended, mapped to
>>>>>> JSON,
>>>>>> > Avro, BigQuery Schemas,
>>>>>> > custom binary format
>>>>>> etc.,
>>>>>> > with same semantics
>>>>>> across
>>>>>> > beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> > This is what jsonp would
>>>>>> allow
>>>>>> > with the benefit of a
>>>>>> natural
>>>>>> > pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Apr 26, 2018 at
>>>>>> > 12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> > <rmannibucau@gmail.com
>>>>>> > <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > Just to let it be
>>>>>> clear
>>>>>> > and let me
>>>>>> understand:
>>>>>> > how is BeamRecord
>>>>>> > different from a
>>>>>> > JsonObject which is
>>>>>> an
>>>>>> > API without
>>>>>> > implementation (not
>>>>>> > event a json one
>>>>>> OOTB)?
>>>>>> > Advantage of json
>>>>>> *api*
>>>>>> > are indeed natural
>>>>>> > mapping (jsonb is
>>>>>> based
>>>>>> > on jsonp so no new
>>>>>> > binding to
>>>>>> reinvent) and
>>>>>> > simple serialization
>>>>>> > (json+gzip for ex,
>>>>>> or
>>>>>> > avro if you want to
>>>>>> be
>>>>>> > geeky).
>>>>>> >
>>>>>> > I fail to see the
>>>>>> point
>>>>>> > to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> > Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> > "Reuven Lax"
>>>>>> > <relax@google.com
>>>>>> > <mailto:
>>>>>> relax@google.com>>
>>>>>> > a écrit :
>>>>>> >
>>>>>> > Exactly what JB
>>>>>> > said. We will
>>>>>> write
>>>>>> > a generic
>>>>>> conversion
>>>>>> > from Avro (or
>>>>>> json)
>>>>>> > to Beam schemas,
>>>>>> > which will make
>>>>>> them
>>>>>> > work
>>>>>> transparently
>>>>>> > with SQL. The
>>>>>> plan
>>>>>> > is also to
>>>>>> migrate
>>>>>> > Anton's work so
>>>>>> that
>>>>>> > POJOs works
>>>>>> > generically for
>>>>>> any
>>>>>> > schema.
>>>>>> >
>>>>>> > Reuven
>>>>>> >
>>>>>> > On Thu, Apr 26,
>>>>>> 2018
>>>>>> > at 1:17 AM
>>>>>> > Jean-Baptiste
>>>>>> Onofré
>>>>>> > <
>>>>>> jb@nanthrax.net
>>>>>> > <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> > wrote:
>>>>>> >
>>>>>> > For now we
>>>>>> have
>>>>>> > a generic
>>>>>> schema
>>>>>> > interface.
>>>>>> > Json-b can
>>>>>> be an
>>>>>> > impl, avro
>>>>>> could
>>>>>> > be another
>>>>>> one.
>>>>>> >
>>>>>> > Regards
>>>>>> > JB
>>>>>> > Le 26 avr.
>>>>>> 2018,
>>>>>> > à 12:08,
>>>>>> Romain
>>>>>> > Manni-Bucau
>>>>>> > <
>>>>>> rmannibucau@gmail.com
>>>>>> > <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> > a écrit:
>>>>>> >
>>>>>> > Hmm,
>>>>>> >
>>>>>> > avro has
>>>>>> > still
>>>>>> the
>>>>>> >
>>>>>> pitfalls to
>>>>>> > have an
>>>>>> >
>>>>>> uncontrolled
>>>>>> > stack
>>>>>> which
>>>>>> > brings
>>>>>> way
>>>>>> > too much
>>>>>> >
>>>>>> dependencies
>>>>>> > to be
>>>>>> part
>>>>>> > of any
>>>>>> API,
>>>>>> > this is
>>>>>> why
>>>>>> > I
>>>>>> proposed a
>>>>>> > JSON-P
>>>>>> based
>>>>>> > API
>>>>>> >
>>>>>> (JsonObject)
>>>>>> > with a
>>>>>> > custom
>>>>>> beam
>>>>>> > entry
>>>>>> for
>>>>>> > some
>>>>>> > metadata
>>>>>> >
>>>>>> (headers "à
>>>>>> > la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> > Romain
>>>>>> >
>>>>>> Manni-Bucau
>>>>>> >
>>>>>> @rmannibucau
>>>>>> > <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> > | Blog
>>>>>> > <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> > Old Blog
>>>>>> > <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> > |
>>>>>> Github
>>>>>> > <
>>>>>> https://github.com/rmannibucau> |
>>>>>> > LinkedIn
>>>>>> > <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> > Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> 2018-04-26
>>>>>> > 9:59
>>>>>> >
>>>>>> GMT+02:00
>>>>>> >
>>>>>> Jean-Baptiste Onofré
>>>>>> > <
>>>>>> jb@nanthrax.net
>>>>>> > <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> > Hi
>>>>>> Ismael
>>>>>> >
>>>>>> > You
>>>>>> mean
>>>>>> >
>>>>>> directly
>>>>>> > in
>>>>>> Beam
>>>>>> > SQL
>>>>>> ?
>>>>>> >
>>>>>> > That
>>>>>> >
>>>>>> will be
>>>>>> >
>>>>>> part of
>>>>>> >
>>>>>> schema
>>>>>> >
>>>>>> support:
>>>>>> >
>>>>>> generic
>>>>>> >
>>>>>> record
>>>>>> >
>>>>>> could be
>>>>>> > one
>>>>>> of
>>>>>> > the
>>>>>> >
>>>>>> payload
>>>>>> > with
>>>>>> >
>>>>>> across
>>>>>> >
>>>>>> schema.
>>>>>> >
>>>>>> >
>>>>>> Regards
>>>>>> > JB
>>>>>> > Le
>>>>>> 26
>>>>>> > avr.
>>>>>> >
>>>>>> 2018, à
>>>>>> >
>>>>>> 11:39,
>>>>>> >
>>>>>> "Ismaël
>>>>>> >
>>>>>> Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>> <ma...@gmail.com>>
>>>>>> > a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>> Hello Anton,
>>>>>> >
>>>>>> >
>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>> is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>> Regards,
>>>>>> >
>>>>>> Ismaël
>>>>>> >
>>>>>> >
>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>> <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> Hi,
>>>>>> >
>>>>>> >
>>>>>> I want
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> highlight
>>>>>> >
>>>>>> a couple
>>>>>> >
>>>>>> of
>>>>>> >
>>>>>> improvements
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> we
>>>>>> >
>>>>>> have
>>>>>> >
>>>>>> been
>>>>>> >
>>>>>> >
>>>>>> working
>>>>>> >
>>>>>> on
>>>>>> >
>>>>>> recently
>>>>>> >
>>>>>> which
>>>>>> >
>>>>>> are
>>>>>> >
>>>>>> targeted
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> make
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> API
>>>>>> >
>>>>>> easier
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> use.
>>>>>> >
>>>>>> >
>>>>>> Specifically
>>>>>> >
>>>>>> these
>>>>>> >
>>>>>> features
>>>>>> >
>>>>>> simplify
>>>>>> >
>>>>>> conversion
>>>>>> >
>>>>>> of
>>>>>> >
>>>>>> Java
>>>>>> >
>>>>>> Beans
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> JSON
>>>>>> >
>>>>>> >
>>>>>> strings
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Feel
>>>>>> >
>>>>>> free
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> try
>>>>>> >
>>>>>> this
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> send
>>>>>> >
>>>>>> any
>>>>>> >
>>>>>> bugs/comments/PRs
>>>>>> >
>>>>>> my
>>>>>> >
>>>>>> way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> **Caveat:
>>>>>> >
>>>>>> this
>>>>>> >
>>>>>> is
>>>>>> >
>>>>>> still
>>>>>> >
>>>>>> work
>>>>>> >
>>>>>> in
>>>>>> >
>>>>>> progress,
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> has
>>>>>> >
>>>>>> known
>>>>>> >
>>>>>> bugs
>>>>>> >
>>>>>> and
>>>>>> >
>>>>>> incomplete
>>>>>> >
>>>>>> >
>>>>>> features,
>>>>>> >
>>>>>> see
>>>>>> >
>>>>>> below
>>>>>> >
>>>>>> for
>>>>>> >
>>>>>> details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> Beam
>>>>>> >
>>>>>> SQL
>>>>>> >
>>>>>> queries
>>>>>> >
>>>>>> can
>>>>>> >
>>>>>> only
>>>>>> >
>>>>>> be
>>>>>> >
>>>>>> applied
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> PCollection<Row>.
>>>>>> >
>>>>>> This
>>>>>> >
>>>>>> means
>>>>>> >
>>>>>> that
>>>>>> >
>>>>>> >
>>>>>> users
>>>>>> >
>>>>>> need
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> convert
>>>>>> >
>>>>>> whatever
>>>>>> >
>>>>>> PCollection
>>>>>> >
>>>>>> elements
>>>>>> >
>>>>>> they
>>>>>> >
>>>>>> have
>>>>>> >
>>>>>> to
>>>>>> >
>>>>>> Rows
>>>>>> >
>>>>>> before
>>>>>> >
>>>>>> >
>>>>>> querying
>>>>>> >
>>>>>> them
>>>>>> >
>>>>>> with
>>>>>> >
>>>>>> SQL.
>>>>>> >
>>>>>> This
>>>>>> >
>>>>>> usually
>>>>>> >
>>>>>> requires
>>>>>> >
>>>>>>
>>>>>
>>>>>
Re: Beam SQL Improvements
Posted by Lukasz Cwik <lc...@google.com>.
Shouldn't the runner isolate each instance of the pipeline behind an
appropriate class loader?
On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
> Just an update: Romain and I chatted on Slack, and I think I understand
> his concern. The concern wasn't specifically about schemas, rather about
> having a generic way to register per-ParDo state that has worker lifetime.
> As evidence that such is needed, in many cases static variables are used to
> simiulate that. static variables however have downsides - if two pipelines
> are run on the same JVM (happens often with unit tests, and there's nothing
> that prevents a runner from doing so in a production environment), these
> static variables will interfere with each other.
>
> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>
>> Romain, maybe it would be useful for us to find some time on slack. I'd
>> like to understand your concerns. Also keep in mind that I'm tagging all
>> these classes as Experimental for now, so we can definitely change these
>> interfaces around if we decide they are not the best ones.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Why not extending ProcessContext to add the new remapped output? But
>>> looks good (the part i dont like is that creating a new context each time a
>>> new feature is added is hurting users. What when beam will add some
>>> reactive support? ReactiveOutputReceiver?)
>>>
>>> Pipeline sounds the wrong storage since once distributed you serialized
>>> the instances so kind of broke the lifecycle of the original instance and
>>> have no real release/close hook on them anymore right? Not sure we can do
>>> better than dofn/source embedded instances today.
>>>
>>>
>>>
>>>
>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>
>>>>> It would be easier for users.
>>>>>
>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>> check
>>>>> in the map if there's a converter available. This map can be store as
>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>
>>>>
>>>>
>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>> require a beam new pipeline api.
>>>>
>>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>>> wouldnt break the fluent api:
>>>>
>>>> p.apply(io).setOutputType(Foo.class)
>>>>
>>>> Coders can be a workaround since they owns the type but since the
>>>> pcollection is the real owner it is surely saner this way, no?
>>>>
>>>> Also it needs to ensure all converters are present before running the
>>>> pipeline probably, no implicit environment converter support is probably
>>>> good to start to avoid late surprises.
>>>>
>>>>
>>>>
>>>>> My $0.01
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>> > How does it work on the pipeline side?
>>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>>> > API to work not erasing generics?
>>>>> >
>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>> > SQL(row)->BigQuery(row)
>>>>> >
>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>> pretask
>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>> duplicate
>>>>> > it or not track it properly. These kind of converters will need a
>>>>> global
>>>>> > close and not only per record in general:
>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>>> add
>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>> more
>>>>> > challenging.
>>>>> >
>>>>> > Romain Manni-Bucau
>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> > <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>> > <ma...@google.com>> a écrit :
>>>>> >
>>>>> > No - the only modules we need to add to core are the ones we
>>>>> choose
>>>>> > to add. For example, I will probably add a registration for
>>>>> > TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>> > with schemas. However I will add that to the GCP module, so only
>>>>> > someone depending on that module need to pull in that dependency.
>>>>> > The Java ServiceLoader framework can be used by these modules to
>>>>> > register schemas for their types (we already do something similar
>>>>> > for FileSystem and for coders as well).
>>>>> >
>>>>> > BTW, right now the conversion back and forth between Row objects
>>>>> I'm
>>>>> > doing in the ByteBuddy generated bytecode that we generate in
>>>>> order
>>>>> > to invoke DoFns.
>>>>> >
>>>>> > Reuven
>>>>> >
>>>>> > On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>> > <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>> >
>>>>> > Hmm, the pluggability part is close to what I wanted to do
>>>>> with
>>>>> > JsonObject as a main API (to avoid to redo a "row" API and
>>>>> > schema API)
>>>>> > Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>>> > beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>> just
>>>>> > trying to understand here)?
>>>>> > If so, how an IO can use as() with the type it expects?
>>>>> Doesnt
>>>>> > it lead to have a tons of these modules at the end?
>>>>> >
>>>>> > Romain Manni-Bucau
>>>>> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> > <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> > Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>> > <ma...@google.com>> a écrit :
>>>>> >
>>>>> > By the way Romain, if you have specific scenarios in
>>>>> mind I
>>>>> > would love to hear them. I can try and guess what exactly
>>>>> > you would like to get out of schemas, but it would work
>>>>> > better if you gave me concrete scenarios that you would
>>>>> like
>>>>> > to work.
>>>>> >
>>>>> > Reuven
>>>>> >
>>>>> > On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>> relax@google.com
>>>>> > <ma...@google.com>> wrote:
>>>>> >
>>>>> > Yeah, what I'm working on will help with IO.
>>>>> Basically
>>>>> > if you register a function with SchemaRegistry that
>>>>> > converts back and forth between a type (say
>>>>> JsonObject)
>>>>> > and a Beam Row, then it is applied by the framework
>>>>> > behind the scenes as part of DoFn invocation.
>>>>> Concrete
>>>>> > example: let's say I have an IO that reads json
>>>>> objects
>>>>> > class MyJsonIORead extends PTransform<PBegin,
>>>>> > JsonObject> {...}
>>>>> >
>>>>> > If you register a schema for this type (or you can
>>>>> also
>>>>> > just set the schema directly on the output
>>>>> PCollection),
>>>>> > then Beam knows how to convert back and forth between
>>>>> > JsonObject and Row. So the next ParDo can look like
>>>>> >
>>>>> > p.apply(new MyJsonIORead())
>>>>> > .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>> > @ProcessElement void process(@Element Row row) {
>>>>> > })
>>>>> >
>>>>> > And Beam will automatically convert JsonObject to a
>>>>> Row
>>>>> > for processing (you aren't forced to do this of
>>>>> course -
>>>>> > you can always ask for it as a JsonObject).
>>>>> >
>>>>> > The same is true for output. If you have a sink that
>>>>> > takes in JsonObject but the transform before it
>>>>> produces
>>>>> > Row objects (for instance - because the transform
>>>>> before
>>>>> > it is Beam SQL), Beam can automatically convert Row
>>>>> back
>>>>> > to JsonObject for you.
>>>>> >
>>>>> > All of this was detailed in the Schema doc I shared a
>>>>> > few months ago. There was a lot of discussion on that
>>>>> > document from various parties, and some of this API
>>>>> is a
>>>>> > result of that discussion. This is also working in
>>>>> the
>>>>> > branch JB and I were working on, though not yet
>>>>> > integrated back to master.
>>>>> >
>>>>> > I would like to actually go further and make Row an
>>>>> > interface and provide a way to automatically put a
>>>>> Row
>>>>> > interface on top of any other object (e.g.
>>>>> JsonObject,
>>>>> > Pojo, etc.) This won't change the way the user writes
>>>>> > code, but instead of Beam having to copy and convert
>>>>> at
>>>>> > each stage (e.g. from JsonObject to Row) it simply
>>>>> will
>>>>> > create a Row object that uses the the JsonObject as
>>>>> its
>>>>> > underlying storage.
>>>>> >
>>>>> > Reuven
>>>>> >
>>>>> > On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>> > <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>>> >>
>>>>> > wrote:
>>>>> >
>>>>> > Well, beam can implement a new mapper but it
>>>>> doesnt
>>>>> > help for io. Most of modern backends will take
>>>>> json
>>>>> > directly, even javax one and it must stay
>>>>> generic.
>>>>> >
>>>>> > Then since json to pojo mapping is already done a
>>>>> > dozen of times, not sure it is worth it for now.
>>>>> >
>>>>> > Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>> > <relax@google.com <ma...@google.com>> a
>>>>> écrit :
>>>>> >
>>>>> > We can do even better btw. Building a
>>>>> > SchemaRegistry where automatic conversions
>>>>> can
>>>>> > be registered between schema and Java data
>>>>> > types. With this the user won't even need a
>>>>> DoFn
>>>>> > to do the conversion.
>>>>> >
>>>>> > On Tue, May 22, 2018, 10:13 AM Romain
>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>> > <ma...@gmail.com>> wrote:
>>>>> >
>>>>> > Hi guys,
>>>>> >
>>>>> > Checked out what has been done on schema
>>>>> > model and think it is acceptable -
>>>>> regarding
>>>>> > the json debate -
>>>>> > if
>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>> > can be fixed.
>>>>> >
>>>>> > High level, it is about providing a
>>>>> > mainstream and not too impacting model
>>>>> OOTB
>>>>> > and JSON seems the most valid option for
>>>>> > now, at least for IO and some user
>>>>> transforms.
>>>>> >
>>>>> > Wdyt?
>>>>> >
>>>>> > Le ven. 27 avr. 2018 18:36, Romain
>>>>> > Manni-Bucau <rmannibucau@gmail.com
>>>>> > <ma...@gmail.com>> a
>>>>> écrit :
>>>>> >
>>>>> > Can give it a try end of may, sure.
>>>>> > (holidays and work constraints will
>>>>> make
>>>>> > it hard before).
>>>>> >
>>>>> > Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>> > <kedin@google.com
>>>>> > <ma...@google.com>> a écrit :
>>>>> >
>>>>> > Romain,
>>>>> >
>>>>> > I don't believe that JSON
>>>>> approach
>>>>> > was investigated very
>>>>> thoroughIy. I
>>>>> > mentioned few reasons which will
>>>>> > make it not the best choice my
>>>>> > opinion, but I may be wrong. Can
>>>>> you
>>>>> > put together a design doc or a
>>>>> > prototype?
>>>>> >
>>>>> > Thank you,
>>>>> > Anton
>>>>> >
>>>>> >
>>>>> > On Thu, Apr 26, 2018 at 10:17 PM
>>>>> > Romain Manni-Bucau
>>>>> > <rmannibucau@gmail.com
>>>>> > <ma...@gmail.com>>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> > Le 26 avr. 2018 23:13, "Anton
>>>>> > Kedin" <kedin@google.com
>>>>> > <ma...@google.com>>
>>>>> a écrit :
>>>>> >
>>>>> > BeamRecord (Row) has very
>>>>> > little in common with
>>>>> > JsonObject (I assume
>>>>> you're
>>>>> > talking about
>>>>> javax.json),
>>>>> > except maybe some
>>>>> > similarities of the API.
>>>>> Few
>>>>> > reasons why JsonObject
>>>>> > doesn't work:
>>>>> >
>>>>> > * it is a Java EE API:
>>>>> > o Beam SDK is not
>>>>> > limited to Java.
>>>>> > There are
>>>>> probably
>>>>> > similar APIs for
>>>>> > other languages
>>>>> but
>>>>> > they might not
>>>>> > necessarily carry
>>>>> > the same
>>>>> semantics /
>>>>> > APIs;
>>>>> >
>>>>> >
>>>>> > Not a big deal I think. At
>>>>> least
>>>>> > not a technical blocker.
>>>>> >
>>>>> > o It can change
>>>>> > between Java
>>>>> versions;
>>>>> >
>>>>> > No, this is javaee ;).
>>>>> >
>>>>> >
>>>>> > o Current Beam java
>>>>> > implementation
>>>>> is an
>>>>> > experimental
>>>>> feature
>>>>> > to identify
>>>>> what's
>>>>> > needed from such
>>>>> > API, in the end
>>>>> we
>>>>> > might end up with
>>>>> > something
>>>>> similar to
>>>>> > JsonObject API,
>>>>> but
>>>>> > likely not
>>>>> >
>>>>> >
>>>>> > I dont get that point as a
>>>>> blocker
>>>>> >
>>>>> > o ;
>>>>> > * represents JSON,
>>>>> which
>>>>> > is not an API but an
>>>>> > object notation:
>>>>> > o it is defined as
>>>>> > unicode string
>>>>> in a
>>>>> > certain format.
>>>>> If
>>>>> > you choose to
>>>>> adhere
>>>>> > to ECMA-404,
>>>>> then it
>>>>> > doesn't sound
>>>>> like
>>>>> > JsonObject can
>>>>> > represent an Avro
>>>>> > object, if I'm
>>>>> > reading it right;
>>>>> >
>>>>> >
>>>>> > It is in the generator impl,
>>>>> you
>>>>> > can impl an avrogenerator.
>>>>> >
>>>>> > * doesn't define a type
>>>>> > system (JSON does,
>>>>> but
>>>>> > it's lacking):
>>>>> > o for example, JSON
>>>>> > doesn't define
>>>>> > semantics for
>>>>> numbers;
>>>>> > o doesn't define
>>>>> > date/time types;
>>>>> > o doesn't allow
>>>>> > extending JSON
>>>>> type
>>>>> > system at all;
>>>>> >
>>>>> >
>>>>> > That is why you need a metada
>>>>> > object, or simpler, a schema
>>>>> > with that data. Json or beam
>>>>> > record doesnt help here and
>>>>> you
>>>>> > end up on the same outcome if
>>>>> > you think about it.
>>>>> >
>>>>> > * lacks schemas;
>>>>> >
>>>>> > Jsonschema are standard,
>>>>> widely
>>>>> > spread and tooled compared to
>>>>> > alternative.
>>>>> >
>>>>> > You can definitely try
>>>>> > loosen the requirements
>>>>> and
>>>>> > define everything in
>>>>> JSON in
>>>>> > userland, but the point
>>>>> of
>>>>> > Row/Schema is to avoid it
>>>>> > and define everything in
>>>>> > Beam model, which can be
>>>>> > extended, mapped to JSON,
>>>>> > Avro, BigQuery Schemas,
>>>>> > custom binary format
>>>>> etc.,
>>>>> > with same semantics
>>>>> across
>>>>> > beam SDKs.
>>>>> >
>>>>> >
>>>>> > This is what jsonp would
>>>>> allow
>>>>> > with the benefit of a natural
>>>>> > pojo support through jsonb.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Apr 26, 2018 at
>>>>> > 12:28 PM Romain
>>>>> Manni-Bucau
>>>>> > <rmannibucau@gmail.com
>>>>> > <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> > wrote:
>>>>> >
>>>>> > Just to let it be
>>>>> clear
>>>>> > and let me
>>>>> understand:
>>>>> > how is BeamRecord
>>>>> > different from a
>>>>> > JsonObject which is
>>>>> an
>>>>> > API without
>>>>> > implementation (not
>>>>> > event a json one
>>>>> OOTB)?
>>>>> > Advantage of json
>>>>> *api*
>>>>> > are indeed natural
>>>>> > mapping (jsonb is
>>>>> based
>>>>> > on jsonp so no new
>>>>> > binding to reinvent)
>>>>> and
>>>>> > simple serialization
>>>>> > (json+gzip for ex, or
>>>>> > avro if you want to
>>>>> be
>>>>> > geeky).
>>>>> >
>>>>> > I fail to see the
>>>>> point
>>>>> > to rebuild an
>>>>> ecosystem ATM.
>>>>> >
>>>>> > Le 26 avr. 2018
>>>>> 19:12,
>>>>> > "Reuven Lax"
>>>>> > <relax@google.com
>>>>> > <mailto:
>>>>> relax@google.com>>
>>>>> > a écrit :
>>>>> >
>>>>> > Exactly what JB
>>>>> > said. We will
>>>>> write
>>>>> > a generic
>>>>> conversion
>>>>> > from Avro (or
>>>>> json)
>>>>> > to Beam schemas,
>>>>> > which will make
>>>>> them
>>>>> > work
>>>>> transparently
>>>>> > with SQL. The
>>>>> plan
>>>>> > is also to
>>>>> migrate
>>>>> > Anton's work so
>>>>> that
>>>>> > POJOs works
>>>>> > generically for
>>>>> any
>>>>> > schema.
>>>>> >
>>>>> > Reuven
>>>>> >
>>>>> > On Thu, Apr 26,
>>>>> 2018
>>>>> > at 1:17 AM
>>>>> > Jean-Baptiste
>>>>> Onofré
>>>>> > <jb@nanthrax.net
>>>>> > <mailto:
>>>>> jb@nanthrax.net>>
>>>>> > wrote:
>>>>> >
>>>>> > For now we
>>>>> have
>>>>> > a generic
>>>>> schema
>>>>> > interface.
>>>>> > Json-b can
>>>>> be an
>>>>> > impl, avro
>>>>> could
>>>>> > be another
>>>>> one.
>>>>> >
>>>>> > Regards
>>>>> > JB
>>>>> > Le 26 avr.
>>>>> 2018,
>>>>> > à 12:08,
>>>>> Romain
>>>>> > Manni-Bucau
>>>>> > <
>>>>> rmannibucau@gmail.com
>>>>> > <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> > a écrit:
>>>>> >
>>>>> > Hmm,
>>>>> >
>>>>> > avro has
>>>>> > still the
>>>>> > pitfalls
>>>>> to
>>>>> > have an
>>>>> >
>>>>> uncontrolled
>>>>> > stack
>>>>> which
>>>>> > brings
>>>>> way
>>>>> > too much
>>>>> >
>>>>> dependencies
>>>>> > to be
>>>>> part
>>>>> > of any
>>>>> API,
>>>>> > this is
>>>>> why
>>>>> > I
>>>>> proposed a
>>>>> > JSON-P
>>>>> based
>>>>> > API
>>>>> >
>>>>> (JsonObject)
>>>>> > with a
>>>>> > custom
>>>>> beam
>>>>> > entry for
>>>>> > some
>>>>> > metadata
>>>>> > (headers
>>>>> "à
>>>>> > la
>>>>> Camel").
>>>>> >
>>>>> >
>>>>> > Romain
>>>>> >
>>>>> Manni-Bucau
>>>>> >
>>>>> @rmannibucau
>>>>> > <
>>>>> https://twitter.com/rmannibucau>
>>>>> > | Blog
>>>>> > <
>>>>> https://rmannibucau.metawerx.net/> |
>>>>> > Old Blog
>>>>> > <
>>>>> http://rmannibucau.wordpress.com>
>>>>> > | Github
>>>>> > <
>>>>> https://github.com/rmannibucau> |
>>>>> > LinkedIn
>>>>> > <
>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>> > Book
>>>>> > <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> 2018-04-26
>>>>> > 9:59
>>>>> > GMT+02:00
>>>>> >
>>>>> Jean-Baptiste Onofré
>>>>> > <
>>>>> jb@nanthrax.net
>>>>> > <mailto:
>>>>> jb@nanthrax.net>>:
>>>>> >
>>>>> >
>>>>> > Hi
>>>>> Ismael
>>>>> >
>>>>> > You
>>>>> mean
>>>>> >
>>>>> directly
>>>>> > in
>>>>> Beam
>>>>> > SQL ?
>>>>> >
>>>>> > That
>>>>> > will
>>>>> be
>>>>> > part
>>>>> of
>>>>> >
>>>>> schema
>>>>> >
>>>>> support:
>>>>> >
>>>>> generic
>>>>> >
>>>>> record
>>>>> >
>>>>> could be
>>>>> > one
>>>>> of
>>>>> > the
>>>>> >
>>>>> payload
>>>>> > with
>>>>> >
>>>>> across
>>>>> >
>>>>> schema.
>>>>> >
>>>>> >
>>>>> Regards
>>>>> > JB
>>>>> > Le 26
>>>>> > avr.
>>>>> >
>>>>> 2018, à
>>>>> >
>>>>> 11:39,
>>>>> >
>>>>> "Ismaël
>>>>> >
>>>>> Mejía" <
>>>>> >
>>>>> iemejia@gmail.com
>>>>> >
>>>>> <ma...@gmail.com>>
>>>>> > a
>>>>> écrit:
>>>>> >
>>>>> >
>>>>> Hello Anton,
>>>>> >
>>>>> >
>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>> >
>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>> >
>>>>> is a natural fit for this approach too.
>>>>> >
>>>>> >
>>>>> Regards,
>>>>> >
>>>>> Ismaël
>>>>> >
>>>>> >
>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>> >
>>>>> <ma...@google.com>> wrote:
>>>>> >
>>>>> >
>>>>>
>>>>> >
>>>>> >
>>>>> Hi,
>>>>> >
>>>>> >
>>>>> I want
>>>>> >
>>>>> to
>>>>> >
>>>>> highlight
>>>>> >
>>>>> a couple
>>>>> >
>>>>> of
>>>>> >
>>>>> improvements
>>>>> >
>>>>> to
>>>>> >
>>>>> Beam
>>>>> >
>>>>> SQL
>>>>> >
>>>>> we
>>>>> >
>>>>> have
>>>>> >
>>>>> been
>>>>> >
>>>>> >
>>>>> working
>>>>> >
>>>>> on
>>>>> >
>>>>> recently
>>>>> >
>>>>> which
>>>>> >
>>>>> are
>>>>> >
>>>>> targeted
>>>>> >
>>>>> to
>>>>> >
>>>>> make
>>>>> >
>>>>> Beam
>>>>> >
>>>>> SQL
>>>>> >
>>>>> API
>>>>> >
>>>>> easier
>>>>> >
>>>>> to
>>>>> >
>>>>> use.
>>>>> >
>>>>> >
>>>>> Specifically
>>>>> >
>>>>> these
>>>>> >
>>>>> features
>>>>> >
>>>>> simplify
>>>>> >
>>>>> conversion
>>>>> >
>>>>> of
>>>>> >
>>>>> Java
>>>>> >
>>>>> Beans
>>>>> >
>>>>> and
>>>>> >
>>>>> JSON
>>>>> >
>>>>> >
>>>>> strings
>>>>> >
>>>>> to
>>>>> >
>>>>> Rows.
>>>>> >
>>>>> >
>>>>> >
>>>>> Feel
>>>>> >
>>>>> free
>>>>> >
>>>>> to
>>>>> >
>>>>> try
>>>>> >
>>>>> this
>>>>> >
>>>>> and
>>>>> >
>>>>> send
>>>>> >
>>>>> any
>>>>> >
>>>>> bugs/comments/PRs
>>>>> >
>>>>> my
>>>>> >
>>>>> way.
>>>>> >
>>>>> >
>>>>> >
>>>>> **Caveat:
>>>>> >
>>>>> this
>>>>> >
>>>>> is
>>>>> >
>>>>> still
>>>>> >
>>>>> work
>>>>> >
>>>>> in
>>>>> >
>>>>> progress,
>>>>> >
>>>>> and
>>>>> >
>>>>> has
>>>>> >
>>>>> known
>>>>> >
>>>>> bugs
>>>>> >
>>>>> and
>>>>> >
>>>>> incomplete
>>>>> >
>>>>> >
>>>>> features,
>>>>> >
>>>>> see
>>>>> >
>>>>> below
>>>>> >
>>>>> for
>>>>> >
>>>>> details.**
>>>>> >
>>>>> >
>>>>> >
>>>>> Background
>>>>> >
>>>>> >
>>>>> >
>>>>> Beam
>>>>> >
>>>>> SQL
>>>>> >
>>>>> queries
>>>>> >
>>>>> can
>>>>> >
>>>>> only
>>>>> >
>>>>> be
>>>>> >
>>>>> applied
>>>>> >
>>>>> to
>>>>> >
>>>>> PCollection<Row>.
>>>>> >
>>>>> This
>>>>> >
>>>>> means
>>>>> >
>>>>> that
>>>>> >
>>>>> >
>>>>> users
>>>>> >
>>>>> need
>>>>> >
>>>>> to
>>>>> >
>>>>> convert
>>>>> >
>>>>> whatever
>>>>> >
>>>>> PCollection
>>>>> >
>>>>> elements
>>>>> >
>>>>> they
>>>>> >
>>>>> have
>>>>> >
>>>>> to
>>>>> >
>>>>> Rows
>>>>> >
>>>>> before
>>>>> >
>>>>> >
>>>>> querying
>>>>> >
>>>>> them
>>>>> >
>>>>> with
>>>>> >
>>>>> SQL.
>>>>> >
>>>>> This
>>>>> >
>>>>> usually
>>>>> >
>>>>> requires
>>>>> >
>>>>>
>>>>
>>>>