You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Romain Manni-Bucau <rm...@gmail.com> on 2018/05/22 17:13:19 UTC

Re: Beam SQL Improvements

Hi guys,

Checked out what has been done on schema model and think it is acceptable -
regarding the json debate -  if
https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.

High level, it is about providing a mainstream and not too impacting model
OOTB and JSON seems the most valid option for now, at least for IO and some
user transforms.

Wdyt?

Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com> a
écrit :

>  Can give it a try end of may, sure. (holidays and work constraints will
> make it hard before).
>
> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>
>> Romain,
>>
>> I don't believe that JSON approach was investigated very thoroughIy. I
>> mentioned few reasons which will make it not the best choice my opinion,
>> but I may be wrong. Can you put together a design doc or a prototype?
>>
>> Thank you,
>> Anton
>>
>>
>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>>
>>>
>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>
>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>> you're talking about javax.json), except maybe some similarities of the
>>> API. Few reasons why JsonObject doesn't work:
>>>
>>>    - it is a Java EE API:
>>>       - Beam SDK is not limited to Java. There are probably similar
>>>       APIs for other languages but they might not necessarily carry the same
>>>       semantics / APIs;
>>>
>>>
>>> Not a big deal I think. At least not a technical blocker.
>>>
>>>
>>>    - It can change between Java versions;
>>>
>>> No, this is javaee ;).
>>>
>>>
>>>
>>>    - Current Beam java implementation is an experimental feature to
>>>       identify what's needed from such API, in the end we might end up with
>>>       something similar to JsonObject API, but likely not
>>>
>>>
>>> I dont get that point as a blocker
>>>
>>>
>>>    - ;
>>>       - represents JSON, which is not an API but an object notation:
>>>       - it is defined as unicode string in a certain format. If you
>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>       represent an Avro object, if I'm reading it right;
>>>
>>>
>>> It is in the generator impl, you can impl an avrogenerator.
>>>
>>>
>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>       - for example, JSON doesn't define semantics for numbers;
>>>       - doesn't define date/time types;
>>>       - doesn't allow extending JSON type system at all;
>>>
>>>
>>> That is why you need a metada object, or simpler, a schema with that
>>> data. Json or beam record doesnt help here and you end up on the same
>>> outcome if you think about it.
>>>
>>>
>>>    - lacks schemas;
>>>
>>> Jsonschema are standard, widely spread and tooled compared to
>>> alternative.
>>>
>>> You can definitely try loosen the requirements and define everything in
>>> JSON in userland, but the point of Row/Schema is to avoid it and define
>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>> beam SDKs.
>>>
>>>
>>> This is what jsonp would allow with the benefit of a natural pojo
>>> support through jsonb.
>>>
>>>
>>>
>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>> different from a JsonObject which is an API without implementation (not
>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>
>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>
>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>>> Exactly what JB said. We will write a generic conversion from Avro (or
>>>>> json) to Beam schemas, which will make them work transparently with SQL.
>>>>> The plan is also to migrate Anton's work so that POJOs works generically
>>>>> for any schema.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>> avro could be another one.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rm...@gmail.com>
>>>>>> a écrit:
>>>>>>>
>>>>>>> Hmm,
>>>>>>>
>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a custom
>>>>>>> beam entry for some metadata (headers "à la Camel").
>>>>>>>
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>
>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>
>>>>>>>> Hi Ismael
>>>>>>>>
>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>
>>>>>>>> That will be part of schema support: generic record could be one of
>>>>>>>> the payload with across schema.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com> a
>>>>>>>> écrit:
>>>>>>>>>
>>>>>>>>> Hello Anton,
>>>>>>>>>
>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Ismaël
>>>>>>>>>
>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>            Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>
>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>
>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>
>>>>>>>>>>  strings to Rows.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>
>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Background
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>
>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>
>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>
>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>           Element>,
>>>>>>>>>>
>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>
>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Status
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>
>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>
>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>
>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>
>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>
>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Java Beans
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>
>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>
>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>
>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>
>>>>>>>>>>>  javaBeans
>>>>>>>>>>>
>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>            class))
>>>>>>>>>>>
>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>
>>>>>>>>>>  logic.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Links
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   example;
>>>>>>>>>>
>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>
>>>>>>>>>>   test;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  JSON
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>
>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>
>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>
>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>
>>>>>>>>>>>  Schema
>>>>>>>>>>>
>>>>>>>>>>>  .builder()
>>>>>>>>>>>
>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>
>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>
>>>>>>>>>>>  .build();
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>
>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>
>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>
>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>
>>>>>>>>>>>  type"));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>
>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Links
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   JsonToRow;
>>>>>>>>>>
>>>>>>>>>>   test/example;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Going Forward
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>
>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>
>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Thank you,
>>>>>>>>>>
>>>>>>>>>>  Anton
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Sure - we can definitely add explicit conversion transforms. The automatic
transform is useful for generic transforms and frameworks (such as SQL)
that want to be able to take in a PCollection and operate on it. However if
users using Schema directly find it easier to have explicit transforms to
do conversion, there's no reason not to add them.

On Tue, May 22, 2018 at 10:55 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> IMHO, it would be better to have a explicit transform/IO as converter.
>
> It would be easier for users.
>
> Another option would be to use a "TypeConverter/SchemaConverter" map as
> we do in Camel: Beam could check the source/destination "type" and check
> in the map if there's a converter available. This map can be store as
> part of the pipeline (as we do for filesystem registration).
>
> My $0.01
>
> Regards
> JB
>
> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
> > How does it work on the pipeline side?
> > Do you generate these "virtual" IO at build time to enable the fluent
> > API to work not erasing generics?
> >
> > ex: SQL(row)->BigQuery(native) will not compile so we need a
> > SQL(row)->BigQuery(row)
> >
> > Side note unrelated to Row: if you add another registry maybe a pretask
> > is to ensure beam has a kind of singleton/context to avoid to duplicate
> > it or not track it properly. These kind of converters will need a global
> > close and not only per record in general:
> > converter.init();converter.convert(row);....converter.destroy();,
> > otherwise it easily leaks. This is why it can require some way to not
> > recreate it. A quick fix, if you are in bytebuddy already, can be to add
> > it to setup/teardown pby, being more global would be nicer but is more
> > challenging.
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
> > <ma...@google.com>> a écrit :
> >
> >     No - the only modules we need to add to core are the ones we choose
> >     to add. For example, I will probably add a registration for
> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
> >     with schemas. However I will add that to the GCP module, so only
> >     someone depending on that module need to pull in that dependency.
> >     The Java ServiceLoader framework can be used by these modules to
> >     register schemas for their types (we already do something similar
> >     for FileSystem and for coders as well).
> >
> >     BTW, right now the conversion back and forth between Row objects I'm
> >     doing in the ByteBuddy generated bytecode that we generate in order
> >     to invoke DoFns.
> >
> >     Reuven
> >
> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> >
> >         Hmm, the pluggability part is close to what I wanted to do with
> >         JsonObject as a main API (to avoid to redo a "row" API and
> >         schema API)
> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
> >         beam-sdk-java-row-jsonobject like modules (I'm not against, just
> >         trying to understand here)?
> >         If so, how an IO can use as() with the type it expects? Doesnt
> >         it lead to have a tons of  these modules at the end?
> >
> >         Romain Manni-Bucau
> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >         <https://rmannibucau.metawerx.net/> | Old Blog
> >         <http://rmannibucau.wordpress.com> | Github
> >         <https://github.com/rmannibucau> | LinkedIn
> >         <https://www.linkedin.com/in/rmannibucau> | Book
> >         <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
> >         <ma...@google.com>> a écrit :
> >
> >             By the way Romain, if you have specific scenarios in mind I
> >             would love to hear them. I can try and guess what exactly
> >             you would like to get out of schemas, but it would work
> >             better if you gave me concrete scenarios that you would like
> >             to work.
> >
> >             Reuven
> >
> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <relax@google.com
> >             <ma...@google.com>> wrote:
> >
> >                 Yeah, what I'm working on will help with IO. Basically
> >                 if you register a function with SchemaRegistry that
> >                 converts back and forth between a type (say JsonObject)
> >                 and a Beam Row, then it is applied by the framework
> >                 behind the scenes as part of DoFn invocation. Concrete
> >                 example: let's say I have an IO that reads json objects
> >                   class MyJsonIORead extends PTransform<PBegin,
> >                 JsonObject> {...}
> >
> >                 If you register a schema for this type (or you can also
> >                 just set the schema directly on the output PCollection),
> >                 then Beam knows how to convert back and forth between
> >                 JsonObject and Row. So the next ParDo can look like
> >
> >                 p.apply(new MyJsonIORead())
> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
> >                     @ProcessElement void process(@Element Row row) {
> >                    })
> >
> >                 And Beam will automatically convert JsonObject to a Row
> >                 for processing (you aren't forced to do this of course -
> >                 you can always ask for it as a JsonObject).
> >
> >                 The same is true for output. If you have a sink that
> >                 takes in JsonObject but the transform before it produces
> >                 Row objects (for instance - because the transform before
> >                 it is Beam SQL), Beam can automatically convert Row back
> >                 to JsonObject for you.
> >
> >                 All of this was detailed in the Schema doc I shared a
> >                 few months ago. There was a lot of discussion on that
> >                 document from various parties, and some of this API is a
> >                 result of that discussion. This is also working in the
> >                 branch JB and I were working on, though not yet
> >                 integrated back to master.
> >
> >                 I would like to actually go further and make Row an
> >                 interface and provide a way to automatically put a Row
> >                 interface on top of any other object (e.g. JsonObject,
> >                 Pojo, etc.) This won't change the way the user writes
> >                 code, but instead of Beam having to copy and convert at
> >                 each stage (e.g. from JsonObject to Row) it simply will
> >                 create a Row object that uses the the JsonObject as its
> >                 underlying storage.
> >
> >                 Reuven
> >
> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
> >                 wrote:
> >
> >                     Well, beam can implement a new mapper but it doesnt
> >                     help for io. Most of modern backends will take json
> >                     directly, even javax one and it must stay generic.
> >
> >                     Then since json to pojo mapping is already done a
> >                     dozen of times, not sure it is worth it for now.
> >
> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
> >                     <relax@google.com <ma...@google.com>> a
> écrit :
> >
> >                         We can do even better btw. Building a
> >                         SchemaRegistry where automatic conversions can
> >                         be registered between schema and Java data
> >                         types. With this the user won't even need a DoFn
> >                         to do the conversion.
> >
> >                         On Tue, May 22, 2018, 10:13 AM Romain
> >                         Manni-Bucau <rmannibucau@gmail.com
> >                         <ma...@gmail.com>> wrote:
> >
> >                             Hi guys,
> >
> >                             Checked out what has been done on schema
> >                             model and think it is acceptable - regarding
> >                             the json debate -
> >                             if
> https://issues.apache.org/jira/browse/BEAM-4381
> >                             can be fixed.
> >
> >                             High level, it is about providing a
> >                             mainstream and not too impacting model OOTB
> >                             and JSON seems the most valid option for
> >                             now, at least for IO and some user
> transforms.
> >
> >                             Wdyt?
> >
> >                             Le ven. 27 avr. 2018 18:36, Romain
> >                             Manni-Bucau <rmannibucau@gmail.com
> >                             <ma...@gmail.com>> a écrit :
> >
> >                                  Can give it a try end of may, sure.
> >                                 (holidays and work constraints will make
> >                                 it hard before).
> >
> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
> >                                 <kedin@google.com
> >                                 <ma...@google.com>> a écrit :
> >
> >                                     Romain,
> >
> >                                     I don't believe that JSON approach
> >                                     was investigated very thoroughIy. I
> >                                     mentioned few reasons which will
> >                                     make it not the best choice my
> >                                     opinion, but I may be wrong. Can you
> >                                     put together a design doc or a
> >                                     prototype?
> >
> >                                     Thank you,
> >                                     Anton
> >
> >
> >                                     On Thu, Apr 26, 2018 at 10:17 PM
> >                                     Romain Manni-Bucau
> >                                     <rmannibucau@gmail.com
> >                                     <ma...@gmail.com>>
> wrote:
> >
> >
> >
> >                                         Le 26 avr. 2018 23:13, "Anton
> >                                         Kedin" <kedin@google.com
> >                                         <ma...@google.com>> a
> écrit :
> >
> >                                             BeamRecord (Row) has very
> >                                             little in common with
> >                                             JsonObject (I assume you're
> >                                             talking about javax.json),
> >                                             except maybe some
> >                                             similarities of the API. Few
> >                                             reasons why JsonObject
> >                                             doesn't work:
> >
> >                                               * it is a Java EE API:
> >                                                   o Beam SDK is not
> >                                                     limited to Java.
> >                                                     There are probably
> >                                                     similar APIs for
> >                                                     other languages but
> >                                                     they might not
> >                                                     necessarily carry
> >                                                     the same semantics /
> >                                                     APIs;
> >
> >
> >                                         Not a big deal I think. At least
> >                                         not a technical blocker.
> >
> >                                                   o It can change
> >                                                     between Java
> versions;
> >
> >                                         No, this is javaee ;).
> >
> >
> >                                                   o Current Beam java
> >                                                     implementation is an
> >                                                     experimental feature
> >                                                     to identify what's
> >                                                     needed from such
> >                                                     API, in the end we
> >                                                     might end up with
> >                                                     something similar to
> >                                                     JsonObject API, but
> >                                                     likely not
> >
> >
> >                                         I dont get that point as a
> blocker
> >
> >                                                   o ;
> >                                               * represents JSON, which
> >                                                 is not an API but an
> >                                                 object notation:
> >                                                   o it is defined as
> >                                                     unicode string in a
> >                                                     certain format. If
> >                                                     you choose to adhere
> >                                                     to ECMA-404, then it
> >                                                     doesn't sound like
> >                                                     JsonObject can
> >                                                     represent an Avro
> >                                                     object, if I'm
> >                                                     reading it right;
> >
> >
> >                                         It is in the generator impl, you
> >                                         can impl an avrogenerator.
> >
> >                                               * doesn't define a type
> >                                                 system (JSON does, but
> >                                                 it's lacking):
> >                                                   o for example, JSON
> >                                                     doesn't define
> >                                                     semantics for
> numbers;
> >                                                   o doesn't define
> >                                                     date/time types;
> >                                                   o doesn't allow
> >                                                     extending JSON type
> >                                                     system at all;
> >
> >
> >                                         That is why you need a metada
> >                                         object, or simpler, a schema
> >                                         with that data. Json or beam
> >                                         record doesnt help here and you
> >                                         end up on the same outcome if
> >                                         you think about it.
> >
> >                                               * lacks schemas;
> >
> >                                         Jsonschema are standard, widely
> >                                         spread and tooled compared to
> >                                         alternative.
> >
> >                                             You can definitely try
> >                                             loosen the requirements and
> >                                             define everything in JSON in
> >                                             userland, but the point of
> >                                             Row/Schema is to avoid it
> >                                             and define everything in
> >                                             Beam model, which can be
> >                                             extended, mapped to JSON,
> >                                             Avro, BigQuery Schemas,
> >                                             custom binary format etc.,
> >                                             with same semantics across
> >                                             beam SDKs.
> >
> >
> >                                         This is what jsonp would allow
> >                                         with the benefit of a natural
> >                                         pojo support through jsonb.
> >
> >
> >
> >                                             On Thu, Apr 26, 2018 at
> >                                             12:28 PM Romain Manni-Bucau
> >                                             <rmannibucau@gmail.com
> >                                             <mailto:
> rmannibucau@gmail.com>>
> >                                             wrote:
> >
> >                                                 Just to let it be clear
> >                                                 and let me understand:
> >                                                 how is BeamRecord
> >                                                 different from a
> >                                                 JsonObject which is an
> >                                                 API without
> >                                                 implementation (not
> >                                                 event a json one OOTB)?
> >                                                 Advantage of json *api*
> >                                                 are indeed natural
> >                                                 mapping (jsonb is based
> >                                                 on jsonp so no new
> >                                                 binding to reinvent) and
> >                                                 simple serialization
> >                                                 (json+gzip for ex, or
> >                                                 avro if you want to be
> >                                                 geeky).
> >
> >                                                 I fail to see the point
> >                                                 to rebuild an ecosystem
> ATM.
> >
> >                                                 Le 26 avr. 2018 19:12,
> >                                                 "Reuven Lax"
> >                                                 <relax@google.com
> >                                                 <mailto:relax@google.com
> >>
> >                                                 a écrit :
> >
> >                                                     Exactly what JB
> >                                                     said. We will write
> >                                                     a generic conversion
> >                                                     from Avro (or json)
> >                                                     to Beam schemas,
> >                                                     which will make them
> >                                                     work transparently
> >                                                     with SQL. The plan
> >                                                     is also to migrate
> >                                                     Anton's work so that
> >                                                     POJOs works
> >                                                     generically for any
> >                                                     schema.
> >
> >                                                     Reuven
> >
> >                                                     On Thu, Apr 26, 2018
> >                                                     at 1:17 AM
> >                                                     Jean-Baptiste Onofré
> >                                                     <jb@nanthrax.net
> >                                                     <mailto:
> jb@nanthrax.net>>
> >                                                     wrote:
> >
> >                                                         For now we have
> >                                                         a generic schema
> >                                                         interface.
> >                                                         Json-b can be an
> >                                                         impl, avro could
> >                                                         be another one.
> >
> >                                                         Regards
> >                                                         JB
> >                                                         Le 26 avr. 2018,
> >                                                         à 12:08, Romain
> >                                                         Manni-Bucau
> >                                                         <
> rmannibucau@gmail.com
> >                                                         <mailto:
> rmannibucau@gmail.com>>
> >                                                         a écrit:
> >
> >                                                             Hmm,
> >
> >                                                             avro has
> >                                                             still the
> >                                                             pitfalls to
> >                                                             have an
> >                                                             uncontrolled
> >                                                             stack which
> >                                                             brings way
> >                                                             too much
> >                                                             dependencies
> >                                                             to be part
> >                                                             of any API,
> >                                                             this is why
> >                                                             I proposed a
> >                                                             JSON-P based
> >                                                             API
> >                                                             (JsonObject)
> >                                                             with a
> >                                                             custom beam
> >                                                             entry for
> >                                                             some
> >                                                             metadata
> >                                                             (headers "à
> >                                                             la Camel").
> >
> >
> >                                                             Romain
> >                                                             Manni-Bucau
> >                                                             @rmannibucau
> >                                                             <
> https://twitter.com/rmannibucau>
> >                                                             |   Blog
> >                                                             <
> https://rmannibucau.metawerx.net/> |
> >                                                             Old Blog
> >                                                             <
> http://rmannibucau.wordpress.com>
> >                                                             |  Github
> >                                                             <
> https://github.com/rmannibucau> |
> >                                                             LinkedIn
> >                                                             <
> https://www.linkedin.com/in/rmannibucau> |
> >                                                             Book
> >                                                             <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> >                                                             2018-04-26
> >                                                             9:59
> >                                                             GMT+02:00
> >
>  Jean-Baptiste Onofré
> >                                                             <
> jb@nanthrax.net
> >                                                             <mailto:
> jb@nanthrax.net>>:
> >
> >
> >                                                                 Hi Ismael
> >
> >                                                                 You mean
> >                                                                 directly
> >                                                                 in Beam
> >                                                                 SQL ?
> >
> >                                                                 That
> >                                                                 will be
> >                                                                 part of
> >                                                                 schema
> >                                                                 support:
> >                                                                 generic
> >                                                                 record
> >                                                                 could be
> >                                                                 one of
> >                                                                 the
> >                                                                 payload
> >                                                                 with
> >                                                                 across
> >                                                                 schema.
> >
> >                                                                 Regards
> >                                                                 JB
> >                                                                 Le 26
> >                                                                 avr.
> >                                                                 2018, à
> >                                                                 11:39,
> >                                                                 "Ismaël
> >                                                                 Mejía" <
> >
> iemejia@gmail.com
> >                                                                 <mailto:
> iemejia@gmail.com>>
> >                                                                 a écrit:
> >
> >
>  Hello Anton,
> >
> >
>  Thanks for the descriptive email and the really useful work. Any plans
> >                                                                     to
> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
> >                                                                     is a
> natural fit for this approach too.
> >
> >
>  Regards,
> >
>  Ismaël
> >
> >                                                                     On
> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
> >
>  <ma...@google.com>> wrote:
> >
> >
>
> >
> >
>  Hi,
> >
> >
>  I want
> >
>  to
> >
>  highlight
> >
>  a couple
> >
>  of
> >
>  improvements
> >
>  to
> >
>  Beam
> >
>  SQL
> >
>  we
> >
>  have
> >
>  been
> >
> >
>  working
> >
>  on
> >
>  recently
> >
>  which
> >
>  are
> >
>  targeted
> >
>  to
> >
>  make
> >
>  Beam
> >
>  SQL
> >
>  API
> >
>  easier
> >
>  to
> >
>  use.
> >
> >
>  Specifically
> >
>  these
> >
>  features
> >
>  simplify
> >
>  conversion
> >
>  of
> >
>  Java
> >
>  Beans
> >
>  and
> >
>  JSON
> >
> >
>  strings
> >
>  to
> >
>  Rows.
> >
> >
> >
>  Feel
> >
>  free
> >
>  to
> >
>  try
> >
>  this
> >
>  and
> >
>  send
> >
>  any
> >
>  bugs/comments/PRs
> >
>  my
> >
>  way.
> >
> >
> >
>  **Caveat:
> >
>  this
> >
>  is
> >
>  still
> >
>  work
> >
>  in
> >
>  progress,
> >
>  and
> >
>  has
> >
>  known
> >
>  bugs
> >
>  and
> >
>  incomplete
> >
> >
>  features,
> >
>  see
> >
>  below
> >
>  for
> >
>  details.**
> >
> >
> >
>  Background
> >
> >
> >
>  Beam
> >
>  SQL
> >
>  queries
> >
>  can
> >
>  only
> >
>  be
> >
>  applied
> >
>  to
> >
>  PCollection<Row>.
> >
>  This
> >
>  means
> >
>  that
> >
> >
>  users
> >
>  need
> >
>  to
> >
>  convert
> >
>  whatever
> >
>  PCollection
> >
>  elements
> >
>  they
> >
>  have
> >
>  to
> >
>  Rows
> >
>  before
> >
> >
>  querying
> >
>  them
> >
>  with
> >
>  SQL.
> >

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
This can create other issues with IO if the runner is not designed for it
(like direct runner) so probably not something reliable for beam generic
part :(.

Le lun. 4 juin 2018 20:10, Lukasz Cwik <lc...@google.com> a écrit :

> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >     No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> >     to add. For example, I will probably add a registration for
>>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>>> >     someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>>> >     register schemas for their types (we already do something
>>>>>> similar
>>>>>> >     for FileSystem and for coders as well).
>>>>>> >
>>>>>> >     BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> >     to invoke DoFns.
>>>>>> >
>>>>>> >     Reuven
>>>>>> >
>>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> >         schema API)
>>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> >         trying to understand here)?
>>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> >         it lead to have a tons of  these modules at the end?
>>>>>> >
>>>>>> >         Romain Manni-Bucau
>>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> >         <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> >         <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >             By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> >             would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> >             you would like to get out of schemas, but it would work
>>>>>> >             better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> >             to work.
>>>>>> >
>>>>>> >             Reuven
>>>>>> >
>>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> >             <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> >                 if you register a function with SchemaRegistry that
>>>>>> >                 converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> >                 example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>>> >                 JsonObject> {...}
>>>>>> >
>>>>>> >                 If you register a schema for this type (or you can
>>>>>> also
>>>>>> >                 just set the schema directly on the output
>>>>>> PCollection),
>>>>>> >                 then Beam knows how to convert back and forth
>>>>>> between
>>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> >                 p.apply(new MyJsonIORead())
>>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>>> >                    })
>>>>>> >
>>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> >                 for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> >                 you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> >                 The same is true for output. If you have a sink that
>>>>>> >                 takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> >                 Row objects (for instance - because the transform
>>>>>> before
>>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> >                 to JsonObject for you.
>>>>>> >
>>>>>> >                 All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> >                 few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> >                 document from various parties, and some of this API
>>>>>> is a
>>>>>> >                 result of that discussion. This is also working in
>>>>>> the
>>>>>> >                 branch JB and I were working on, though not yet
>>>>>> >                 integrated back to master.
>>>>>> >
>>>>>> >                 I would like to actually go further and make Row an
>>>>>> >                 interface and provide a way to automatically put a
>>>>>> Row
>>>>>> >                 interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> >                 Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> >                 code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> >                 create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> >                 underlying storage.
>>>>>> >
>>>>>> >                 Reuven
>>>>>> >
>>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> >                 <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                 wrote:
>>>>>> >
>>>>>> >                     Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> >                     help for io. Most of modern backends will take
>>>>>> json
>>>>>> >                     directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> >                     Then since json to pojo mapping is already done
>>>>>> a
>>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                         We can do even better btw. Building a
>>>>>> >                         SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> >                         be registered between schema and Java data
>>>>>> >                         types. With this the user won't even need a
>>>>>> DoFn
>>>>>> >                         to do the conversion.
>>>>>> >
>>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                         <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >                             Hi guys,
>>>>>> >
>>>>>> >                             Checked out what has been done on schema
>>>>>> >                             model and think it is acceptable -
>>>>>> regarding
>>>>>> >                             the json debate -
>>>>>> >                             if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> >                             can be fixed.
>>>>>> >
>>>>>> >                             High level, it is about providing a
>>>>>> >                             mainstream and not too impacting model
>>>>>> OOTB
>>>>>> >                             and JSON seems the most valid option for
>>>>>> >                             now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> >                             Wdyt?
>>>>>> >
>>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                             <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                  Can give it a try end of may, sure.
>>>>>> >                                 (holidays and work constraints will
>>>>>> make
>>>>>> >                                 it hard before).
>>>>>> >
>>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> >                                 <kedin@google.com
>>>>>> >                                 <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                     Romain,
>>>>>> >
>>>>>> >                                     I don't believe that JSON
>>>>>> approach
>>>>>> >                                     was investigated very
>>>>>> thoroughIy. I
>>>>>> >                                     mentioned few reasons which will
>>>>>> >                                     make it not the best choice my
>>>>>> >                                     opinion, but I may be wrong.
>>>>>> Can you
>>>>>> >                                     put together a design doc or a
>>>>>> >                                     prototype?
>>>>>> >
>>>>>> >                                     Thank you,
>>>>>> >                                     Anton
>>>>>> >
>>>>>> >
>>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> >                                     Romain Manni-Bucau
>>>>>> >                                     <rmannibucau@gmail.com
>>>>>> >                                     <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                         Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> >                                         Kedin" <kedin@google.com
>>>>>> >                                         <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> >                                             BeamRecord (Row) has
>>>>>> very
>>>>>> >                                             little in common with
>>>>>> >                                             JsonObject (I assume
>>>>>> you're
>>>>>> >                                             talking about
>>>>>> javax.json),
>>>>>> >                                             except maybe some
>>>>>> >                                             similarities of the
>>>>>> API. Few
>>>>>> >                                             reasons why JsonObject
>>>>>> >                                             doesn't work:
>>>>>> >
>>>>>> >                                               * it is a Java EE API:
>>>>>> >                                                   o Beam SDK is not
>>>>>> >                                                     limited to Java.
>>>>>> >                                                     There are
>>>>>> probably
>>>>>> >                                                     similar APIs for
>>>>>> >                                                     other languages
>>>>>> but
>>>>>> >                                                     they might not
>>>>>> >                                                     necessarily
>>>>>> carry
>>>>>> >                                                     the same
>>>>>> semantics /
>>>>>> >                                                     APIs;
>>>>>> >
>>>>>> >
>>>>>> >                                         Not a big deal I think. At
>>>>>> least
>>>>>> >                                         not a technical blocker.
>>>>>> >
>>>>>> >                                                   o It can change
>>>>>> >                                                     between Java
>>>>>> versions;
>>>>>> >
>>>>>> >                                         No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> >                                                   o Current Beam
>>>>>> java
>>>>>> >                                                     implementation
>>>>>> is an
>>>>>> >                                                     experimental
>>>>>> feature
>>>>>> >                                                     to identify
>>>>>> what's
>>>>>> >                                                     needed from such
>>>>>> >                                                     API, in the end
>>>>>> we
>>>>>> >                                                     might end up
>>>>>> with
>>>>>> >                                                     something
>>>>>> similar to
>>>>>> >                                                     JsonObject API,
>>>>>> but
>>>>>> >                                                     likely not
>>>>>> >
>>>>>> >
>>>>>> >                                         I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> >                                                   o ;
>>>>>> >                                               * represents JSON,
>>>>>> which
>>>>>> >                                                 is not an API but an
>>>>>> >                                                 object notation:
>>>>>> >                                                   o it is defined as
>>>>>> >                                                     unicode string
>>>>>> in a
>>>>>> >                                                     certain format.
>>>>>> If
>>>>>> >                                                     you choose to
>>>>>> adhere
>>>>>> >                                                     to ECMA-404,
>>>>>> then it
>>>>>> >                                                     doesn't sound
>>>>>> like
>>>>>> >                                                     JsonObject can
>>>>>> >                                                     represent an
>>>>>> Avro
>>>>>> >                                                     object, if I'm
>>>>>> >                                                     reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> >                                         It is in the generator
>>>>>> impl, you
>>>>>> >                                         can impl an avrogenerator.
>>>>>> >
>>>>>> >                                               * doesn't define a
>>>>>> type
>>>>>> >                                                 system (JSON does,
>>>>>> but
>>>>>> >                                                 it's lacking):
>>>>>> >                                                   o for example,
>>>>>> JSON
>>>>>> >                                                     doesn't define
>>>>>> >                                                     semantics for
>>>>>> numbers;
>>>>>> >                                                   o doesn't define
>>>>>> >                                                     date/time types;
>>>>>> >                                                   o doesn't allow
>>>>>> >                                                     extending JSON
>>>>>> type
>>>>>> >                                                     system at all;
>>>>>> >
>>>>>> >
>>>>>> >                                         That is why you need a
>>>>>> metada
>>>>>> >                                         object, or simpler, a schema
>>>>>> >                                         with that data. Json or beam
>>>>>> >                                         record doesnt help here and
>>>>>> you
>>>>>> >                                         end up on the same outcome
>>>>>> if
>>>>>> >                                         you think about it.
>>>>>> >
>>>>>> >                                               * lacks schemas;
>>>>>> >
>>>>>> >                                         Jsonschema are standard,
>>>>>> widely
>>>>>> >                                         spread and tooled compared
>>>>>> to
>>>>>> >                                         alternative.
>>>>>> >
>>>>>> >                                             You can definitely try
>>>>>> >                                             loosen the requirements
>>>>>> and
>>>>>> >                                             define everything in
>>>>>> JSON in
>>>>>> >                                             userland, but the point
>>>>>> of
>>>>>> >                                             Row/Schema is to avoid
>>>>>> it
>>>>>> >                                             and define everything in
>>>>>> >                                             Beam model, which can be
>>>>>> >                                             extended, mapped to
>>>>>> JSON,
>>>>>> >                                             Avro, BigQuery Schemas,
>>>>>> >                                             custom binary format
>>>>>> etc.,
>>>>>> >                                             with same semantics
>>>>>> across
>>>>>> >                                             beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> >                                         This is what jsonp would
>>>>>> allow
>>>>>> >                                         with the benefit of a
>>>>>> natural
>>>>>> >                                         pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>>> >                                             12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> >                                             <rmannibucau@gmail.com
>>>>>> >                                             <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                             wrote:
>>>>>> >
>>>>>> >                                                 Just to let it be
>>>>>> clear
>>>>>> >                                                 and let me
>>>>>> understand:
>>>>>> >                                                 how is BeamRecord
>>>>>> >                                                 different from a
>>>>>> >                                                 JsonObject which is
>>>>>> an
>>>>>> >                                                 API without
>>>>>> >                                                 implementation (not
>>>>>> >                                                 event a json one
>>>>>> OOTB)?
>>>>>> >                                                 Advantage of json
>>>>>> *api*
>>>>>> >                                                 are indeed natural
>>>>>> >                                                 mapping (jsonb is
>>>>>> based
>>>>>> >                                                 on jsonp so no new
>>>>>> >                                                 binding to
>>>>>> reinvent) and
>>>>>> >                                                 simple serialization
>>>>>> >                                                 (json+gzip for ex,
>>>>>> or
>>>>>> >                                                 avro if you want to
>>>>>> be
>>>>>> >                                                 geeky).
>>>>>> >
>>>>>> >                                                 I fail to see the
>>>>>> point
>>>>>> >                                                 to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> >                                                 Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> >                                                 "Reuven Lax"
>>>>>> >                                                 <relax@google.com
>>>>>> >                                                 <mailto:
>>>>>> relax@google.com>>
>>>>>> >                                                 a écrit :
>>>>>> >
>>>>>> >                                                     Exactly what JB
>>>>>> >                                                     said. We will
>>>>>> write
>>>>>> >                                                     a generic
>>>>>> conversion
>>>>>> >                                                     from Avro (or
>>>>>> json)
>>>>>> >                                                     to Beam schemas,
>>>>>> >                                                     which will make
>>>>>> them
>>>>>> >                                                     work
>>>>>> transparently
>>>>>> >                                                     with SQL. The
>>>>>> plan
>>>>>> >                                                     is also to
>>>>>> migrate
>>>>>> >                                                     Anton's work so
>>>>>> that
>>>>>> >                                                     POJOs works
>>>>>> >                                                     generically for
>>>>>> any
>>>>>> >                                                     schema.
>>>>>> >
>>>>>> >                                                     Reuven
>>>>>> >
>>>>>> >                                                     On Thu, Apr 26,
>>>>>> 2018
>>>>>> >                                                     at 1:17 AM
>>>>>> >                                                     Jean-Baptiste
>>>>>> Onofré
>>>>>> >                                                     <
>>>>>> jb@nanthrax.net
>>>>>> >                                                     <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> >                                                     wrote:
>>>>>> >
>>>>>> >                                                         For now we
>>>>>> have
>>>>>> >                                                         a generic
>>>>>> schema
>>>>>> >                                                         interface.
>>>>>> >                                                         Json-b can
>>>>>> be an
>>>>>> >                                                         impl, avro
>>>>>> could
>>>>>> >                                                         be another
>>>>>> one.
>>>>>> >
>>>>>> >                                                         Regards
>>>>>> >                                                         JB
>>>>>> >                                                         Le 26 avr.
>>>>>> 2018,
>>>>>> >                                                         à 12:08,
>>>>>> Romain
>>>>>> >                                                         Manni-Bucau
>>>>>> >                                                         <
>>>>>> rmannibucau@gmail.com
>>>>>> >                                                         <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                                         a écrit:
>>>>>> >
>>>>>> >                                                             Hmm,
>>>>>> >
>>>>>> >                                                             avro has
>>>>>> >                                                             still
>>>>>> the
>>>>>> >
>>>>>>  pitfalls to
>>>>>> >                                                             have an
>>>>>> >
>>>>>>  uncontrolled
>>>>>> >                                                             stack
>>>>>> which
>>>>>> >                                                             brings
>>>>>> way
>>>>>> >                                                             too much
>>>>>> >
>>>>>>  dependencies
>>>>>> >                                                             to be
>>>>>> part
>>>>>> >                                                             of any
>>>>>> API,
>>>>>> >                                                             this is
>>>>>> why
>>>>>> >                                                             I
>>>>>> proposed a
>>>>>> >                                                             JSON-P
>>>>>> based
>>>>>> >                                                             API
>>>>>> >
>>>>>>  (JsonObject)
>>>>>> >                                                             with a
>>>>>> >                                                             custom
>>>>>> beam
>>>>>> >                                                             entry
>>>>>> for
>>>>>> >                                                             some
>>>>>> >                                                             metadata
>>>>>> >
>>>>>>  (headers "à
>>>>>> >                                                             la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> >                                                             Romain
>>>>>> >
>>>>>>  Manni-Bucau
>>>>>> >
>>>>>>  @rmannibucau
>>>>>> >                                                             <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> >                                                             |   Blog
>>>>>> >                                                             <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> >                                                             Old Blog
>>>>>> >                                                             <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> >                                                             |
>>>>>> Github
>>>>>> >                                                             <
>>>>>> https://github.com/rmannibucau> |
>>>>>> >                                                             LinkedIn
>>>>>> >                                                             <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> >                                                             Book
>>>>>> >                                                             <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>  2018-04-26
>>>>>> >                                                             9:59
>>>>>> >
>>>>>>  GMT+02:00
>>>>>> >
>>>>>>  Jean-Baptiste Onofré
>>>>>> >                                                             <
>>>>>> jb@nanthrax.net
>>>>>> >                                                             <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> >                                                                 Hi
>>>>>> Ismael
>>>>>> >
>>>>>> >                                                                 You
>>>>>> mean
>>>>>> >
>>>>>>  directly
>>>>>> >                                                                 in
>>>>>> Beam
>>>>>> >                                                                 SQL
>>>>>> ?
>>>>>> >
>>>>>> >                                                                 That
>>>>>> >
>>>>>>  will be
>>>>>> >
>>>>>>  part of
>>>>>> >
>>>>>>  schema
>>>>>> >
>>>>>>  support:
>>>>>> >
>>>>>>  generic
>>>>>> >
>>>>>>  record
>>>>>> >
>>>>>>  could be
>>>>>> >                                                                 one
>>>>>> of
>>>>>> >                                                                 the
>>>>>> >
>>>>>>  payload
>>>>>> >                                                                 with
>>>>>> >
>>>>>>  across
>>>>>> >
>>>>>>  schema.
>>>>>> >
>>>>>> >
>>>>>>  Regards
>>>>>> >                                                                 JB
>>>>>> >                                                                 Le
>>>>>> 26
>>>>>> >                                                                 avr.
>>>>>> >
>>>>>>  2018, à
>>>>>> >
>>>>>>  11:39,
>>>>>> >
>>>>>>  "Ismaël
>>>>>> >
>>>>>>  Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>>  <ma...@gmail.com>>
>>>>>> >                                                                 a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>>  Hello Anton,
>>>>>> >
>>>>>> >
>>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>>  is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>>  Regards,
>>>>>> >
>>>>>>  Ismaël
>>>>>> >
>>>>>> >
>>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>>  <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>>      Hi,
>>>>>> >
>>>>>> >
>>>>>>      I want
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      highlight
>>>>>> >
>>>>>>      a couple
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      improvements
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      we
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      been
>>>>>> >
>>>>>> >
>>>>>>      working
>>>>>> >
>>>>>>      on
>>>>>> >
>>>>>>      recently
>>>>>> >
>>>>>>      which
>>>>>> >
>>>>>>      are
>>>>>> >
>>>>>>      targeted
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      make
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      API
>>>>>> >
>>>>>>      easier
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      use.
>>>>>> >
>>>>>> >
>>>>>>      Specifically
>>>>>> >
>>>>>>      these
>>>>>> >
>>>>>>      features
>>>>>> >
>>>>>>      simplify
>>>>>> >
>>>>>>      conversion
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      Java
>>>>>> >
>>>>>>      Beans
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      JSON
>>>>>> >
>>>>>> >
>>>>>>      strings
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Feel
>>>>>> >
>>>>>>      free
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      try
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      send
>>>>>> >
>>>>>>      any
>>>>>> >
>>>>>>      bugs/comments/PRs
>>>>>> >
>>>>>>      my
>>>>>> >
>>>>>>      way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      **Caveat:
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      is
>>>>>> >
>>>>>>      still
>>>>>> >
>>>>>>      work
>>>>>> >
>>>>>>      in
>>>>>> >
>>>>>>      progress,
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      has
>>>>>> >
>>>>>>      known
>>>>>> >
>>>>>>      bugs
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      incomplete
>>>>>> >
>>>>>> >
>>>>>>      features,
>>>>>> >
>>>>>>      see
>>>>>> >
>>>>>>      below
>>>>>> >
>>>>>>      for
>>>>>> >
>>>>>>      details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      queries
>>>>>> >
>>>>>>      can
>>>>>> >
>>>>>>      only
>>>>>> >
>>>>>>      be
>>>>>> >
>>>>>>      applied
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      PCollection<Row>.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      means
>>>>>> >
>>>>>>      that
>>>>>> >
>>>>>> >
>>>>>>      users
>>>>>> >
>>>>>>      need
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      convert
>>>>>> >
>>>>>>      whatever
>>>>>> >
>>>>>>      PCollection
>>>>>> >
>>>>>>      elements
>>>>>> >
>>>>>>      they
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows
>>>>>> >
>>>>>>      before
>>>>>> >
>>>>>> >
>>>>>>      querying
>>>>>> >
>>>>>>      them
>>>>>> >
>>>>>>      with
>>>>>> >
>>>>>>      SQL.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      usually
>>>>>> >
>>>>>>      requires
>>>>>> >
>>>>>>
>>>>>
>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Does DirectRunner do this today?

On Mon, Jun 4, 2018 at 9:10 PM Lukasz Cwik <lc...@google.com> wrote:

> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Why not extending ProcessContext to add the new remapped output? But
>>>> looks good (the part i dont like is that creating a new context each time a
>>>> new feature is added is hurting users. What when beam will add some
>>>> reactive support? ReactiveOutputReceiver?)
>>>>
>>>> Pipeline sounds the wrong storage since once distributed you serialized
>>>> the instances so kind of broke the lifecycle of the original instance and
>>>> have no real release/close hook on them anymore right? Not sure we can do
>>>> better than dofn/source embedded instances today.
>>>>
>>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>
>>>>>
>>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>>
>>>>>> It would be easier for users.
>>>>>>
>>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map
>>>>>> as
>>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>>> check
>>>>>> in the map if there's a converter available. This map can be store as
>>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>>
>>>>>
>>>>>
>>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>>> require a beam new pipeline api.
>>>>>
>>>>> +1 for the explicit transform, if added to the pipeline api as coder
>>>>> it wouldnt break the fluent api:
>>>>>
>>>>> p.apply(io).setOutputType(Foo.class)
>>>>>
>>>>> Coders can be a workaround since they owns the type but since the
>>>>> pcollection is the real owner it is surely saner this way, no?
>>>>>
>>>>> Also it needs to ensure all converters are present before running the
>>>>> pipeline probably, no implicit environment converter support is probably
>>>>> good to start to avoid late surprises.
>>>>>
>>>>>
>>>>>
>>>>>> My $0.01
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>>> > How does it work on the pipeline side?
>>>>>> > Do you generate these "virtual" IO at build time to enable the
>>>>>> fluent
>>>>>> > API to work not erasing generics?
>>>>>> >
>>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>>> > SQL(row)->BigQuery(row)
>>>>>> >
>>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>>> pretask
>>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>>> duplicate
>>>>>> > it or not track it properly. These kind of converters will need a
>>>>>> global
>>>>>> > close and not only per record in general:
>>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>>> > otherwise it easily leaks. This is why it can require some way to
>>>>>> not
>>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>>>>>> to add
>>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>>> more
>>>>>> > challenging.
>>>>>> >
>>>>>> > Romain Manni-Bucau
>>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> > <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>>> > <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >     No - the only modules we need to add to core are the ones we
>>>>>> choose
>>>>>> >     to add. For example, I will probably add a registration for
>>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>>> >     someone depending on that module need to pull in that
>>>>>> dependency.
>>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>>> >     register schemas for their types (we already do something
>>>>>> similar
>>>>>> >     for FileSystem and for coders as well).
>>>>>> >
>>>>>> >     BTW, right now the conversion back and forth between Row
>>>>>> objects I'm
>>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>>> order
>>>>>> >     to invoke DoFns.
>>>>>> >
>>>>>> >     Reuven
>>>>>> >
>>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>>> with
>>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>>> >         schema API)
>>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll
>>>>>> get
>>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>>> just
>>>>>> >         trying to understand here)?
>>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>>> Doesnt
>>>>>> >         it lead to have a tons of  these modules at the end?
>>>>>> >
>>>>>> >         Romain Manni-Bucau
>>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> >         <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>>> >         <ma...@google.com>> a écrit :
>>>>>> >
>>>>>> >             By the way Romain, if you have specific scenarios in
>>>>>> mind I
>>>>>> >             would love to hear them. I can try and guess what
>>>>>> exactly
>>>>>> >             you would like to get out of schemas, but it would work
>>>>>> >             better if you gave me concrete scenarios that you would
>>>>>> like
>>>>>> >             to work.
>>>>>> >
>>>>>> >             Reuven
>>>>>> >
>>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>>> relax@google.com
>>>>>> >             <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>>> Basically
>>>>>> >                 if you register a function with SchemaRegistry that
>>>>>> >                 converts back and forth between a type (say
>>>>>> JsonObject)
>>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>>> Concrete
>>>>>> >                 example: let's say I have an IO that reads json
>>>>>> objects
>>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>>> >                 JsonObject> {...}
>>>>>> >
>>>>>> >                 If you register a schema for this type (or you can
>>>>>> also
>>>>>> >                 just set the schema directly on the output
>>>>>> PCollection),
>>>>>> >                 then Beam knows how to convert back and forth
>>>>>> between
>>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>>> >
>>>>>> >                 p.apply(new MyJsonIORead())
>>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>>> >                    })
>>>>>> >
>>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>>> Row
>>>>>> >                 for processing (you aren't forced to do this of
>>>>>> course -
>>>>>> >                 you can always ask for it as a JsonObject).
>>>>>> >
>>>>>> >                 The same is true for output. If you have a sink that
>>>>>> >                 takes in JsonObject but the transform before it
>>>>>> produces
>>>>>> >                 Row objects (for instance - because the transform
>>>>>> before
>>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>>> back
>>>>>> >                 to JsonObject for you.
>>>>>> >
>>>>>> >                 All of this was detailed in the Schema doc I shared
>>>>>> a
>>>>>> >                 few months ago. There was a lot of discussion on
>>>>>> that
>>>>>> >                 document from various parties, and some of this API
>>>>>> is a
>>>>>> >                 result of that discussion. This is also working in
>>>>>> the
>>>>>> >                 branch JB and I were working on, though not yet
>>>>>> >                 integrated back to master.
>>>>>> >
>>>>>> >                 I would like to actually go further and make Row an
>>>>>> >                 interface and provide a way to automatically put a
>>>>>> Row
>>>>>> >                 interface on top of any other object (e.g.
>>>>>> JsonObject,
>>>>>> >                 Pojo, etc.) This won't change the way the user
>>>>>> writes
>>>>>> >                 code, but instead of Beam having to copy and
>>>>>> convert at
>>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>>> will
>>>>>> >                 create a Row object that uses the the JsonObject as
>>>>>> its
>>>>>> >                 underlying storage.
>>>>>> >
>>>>>> >                 Reuven
>>>>>> >
>>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>>> >                 <rmannibucau@gmail.com <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                 wrote:
>>>>>> >
>>>>>> >                     Well, beam can implement a new mapper but it
>>>>>> doesnt
>>>>>> >                     help for io. Most of modern backends will take
>>>>>> json
>>>>>> >                     directly, even javax one and it must stay
>>>>>> generic.
>>>>>> >
>>>>>> >                     Then since json to pojo mapping is already done
>>>>>> a
>>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>>> >
>>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                         We can do even better btw. Building a
>>>>>> >                         SchemaRegistry where automatic conversions
>>>>>> can
>>>>>> >                         be registered between schema and Java data
>>>>>> >                         types. With this the user won't even need a
>>>>>> DoFn
>>>>>> >                         to do the conversion.
>>>>>> >
>>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                         <ma...@gmail.com>> wrote:
>>>>>> >
>>>>>> >                             Hi guys,
>>>>>> >
>>>>>> >                             Checked out what has been done on schema
>>>>>> >                             model and think it is acceptable -
>>>>>> regarding
>>>>>> >                             the json debate -
>>>>>> >                             if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>>> >                             can be fixed.
>>>>>> >
>>>>>> >                             High level, it is about providing a
>>>>>> >                             mainstream and not too impacting model
>>>>>> OOTB
>>>>>> >                             and JSON seems the most valid option for
>>>>>> >                             now, at least for IO and some user
>>>>>> transforms.
>>>>>> >
>>>>>> >                             Wdyt?
>>>>>> >
>>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>>> >                             <ma...@gmail.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                  Can give it a try end of may, sure.
>>>>>> >                                 (holidays and work constraints will
>>>>>> make
>>>>>> >                                 it hard before).
>>>>>> >
>>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>>> >                                 <kedin@google.com
>>>>>> >                                 <ma...@google.com>> a
>>>>>> écrit :
>>>>>> >
>>>>>> >                                     Romain,
>>>>>> >
>>>>>> >                                     I don't believe that JSON
>>>>>> approach
>>>>>> >                                     was investigated very
>>>>>> thoroughIy. I
>>>>>> >                                     mentioned few reasons which will
>>>>>> >                                     make it not the best choice my
>>>>>> >                                     opinion, but I may be wrong.
>>>>>> Can you
>>>>>> >                                     put together a design doc or a
>>>>>> >                                     prototype?
>>>>>> >
>>>>>> >                                     Thank you,
>>>>>> >                                     Anton
>>>>>> >
>>>>>> >
>>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>>> >                                     Romain Manni-Bucau
>>>>>> >                                     <rmannibucau@gmail.com
>>>>>> >                                     <ma...@gmail.com>>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                         Le 26 avr. 2018 23:13,
>>>>>> "Anton
>>>>>> >                                         Kedin" <kedin@google.com
>>>>>> >                                         <ma...@google.com>>
>>>>>> a écrit :
>>>>>> >
>>>>>> >                                             BeamRecord (Row) has
>>>>>> very
>>>>>> >                                             little in common with
>>>>>> >                                             JsonObject (I assume
>>>>>> you're
>>>>>> >                                             talking about
>>>>>> javax.json),
>>>>>> >                                             except maybe some
>>>>>> >                                             similarities of the
>>>>>> API. Few
>>>>>> >                                             reasons why JsonObject
>>>>>> >                                             doesn't work:
>>>>>> >
>>>>>> >                                               * it is a Java EE API:
>>>>>> >                                                   o Beam SDK is not
>>>>>> >                                                     limited to Java.
>>>>>> >                                                     There are
>>>>>> probably
>>>>>> >                                                     similar APIs for
>>>>>> >                                                     other languages
>>>>>> but
>>>>>> >                                                     they might not
>>>>>> >                                                     necessarily
>>>>>> carry
>>>>>> >                                                     the same
>>>>>> semantics /
>>>>>> >                                                     APIs;
>>>>>> >
>>>>>> >
>>>>>> >                                         Not a big deal I think. At
>>>>>> least
>>>>>> >                                         not a technical blocker.
>>>>>> >
>>>>>> >                                                   o It can change
>>>>>> >                                                     between Java
>>>>>> versions;
>>>>>> >
>>>>>> >                                         No, this is javaee ;).
>>>>>> >
>>>>>> >
>>>>>> >                                                   o Current Beam
>>>>>> java
>>>>>> >                                                     implementation
>>>>>> is an
>>>>>> >                                                     experimental
>>>>>> feature
>>>>>> >                                                     to identify
>>>>>> what's
>>>>>> >                                                     needed from such
>>>>>> >                                                     API, in the end
>>>>>> we
>>>>>> >                                                     might end up
>>>>>> with
>>>>>> >                                                     something
>>>>>> similar to
>>>>>> >                                                     JsonObject API,
>>>>>> but
>>>>>> >                                                     likely not
>>>>>> >
>>>>>> >
>>>>>> >                                         I dont get that point as a
>>>>>> blocker
>>>>>> >
>>>>>> >                                                   o ;
>>>>>> >                                               * represents JSON,
>>>>>> which
>>>>>> >                                                 is not an API but an
>>>>>> >                                                 object notation:
>>>>>> >                                                   o it is defined as
>>>>>> >                                                     unicode string
>>>>>> in a
>>>>>> >                                                     certain format.
>>>>>> If
>>>>>> >                                                     you choose to
>>>>>> adhere
>>>>>> >                                                     to ECMA-404,
>>>>>> then it
>>>>>> >                                                     doesn't sound
>>>>>> like
>>>>>> >                                                     JsonObject can
>>>>>> >                                                     represent an
>>>>>> Avro
>>>>>> >                                                     object, if I'm
>>>>>> >                                                     reading it
>>>>>> right;
>>>>>> >
>>>>>> >
>>>>>> >                                         It is in the generator
>>>>>> impl, you
>>>>>> >                                         can impl an avrogenerator.
>>>>>> >
>>>>>> >                                               * doesn't define a
>>>>>> type
>>>>>> >                                                 system (JSON does,
>>>>>> but
>>>>>> >                                                 it's lacking):
>>>>>> >                                                   o for example,
>>>>>> JSON
>>>>>> >                                                     doesn't define
>>>>>> >                                                     semantics for
>>>>>> numbers;
>>>>>> >                                                   o doesn't define
>>>>>> >                                                     date/time types;
>>>>>> >                                                   o doesn't allow
>>>>>> >                                                     extending JSON
>>>>>> type
>>>>>> >                                                     system at all;
>>>>>> >
>>>>>> >
>>>>>> >                                         That is why you need a
>>>>>> metada
>>>>>> >                                         object, or simpler, a schema
>>>>>> >                                         with that data. Json or beam
>>>>>> >                                         record doesnt help here and
>>>>>> you
>>>>>> >                                         end up on the same outcome
>>>>>> if
>>>>>> >                                         you think about it.
>>>>>> >
>>>>>> >                                               * lacks schemas;
>>>>>> >
>>>>>> >                                         Jsonschema are standard,
>>>>>> widely
>>>>>> >                                         spread and tooled compared
>>>>>> to
>>>>>> >                                         alternative.
>>>>>> >
>>>>>> >                                             You can definitely try
>>>>>> >                                             loosen the requirements
>>>>>> and
>>>>>> >                                             define everything in
>>>>>> JSON in
>>>>>> >                                             userland, but the point
>>>>>> of
>>>>>> >                                             Row/Schema is to avoid
>>>>>> it
>>>>>> >                                             and define everything in
>>>>>> >                                             Beam model, which can be
>>>>>> >                                             extended, mapped to
>>>>>> JSON,
>>>>>> >                                             Avro, BigQuery Schemas,
>>>>>> >                                             custom binary format
>>>>>> etc.,
>>>>>> >                                             with same semantics
>>>>>> across
>>>>>> >                                             beam SDKs.
>>>>>> >
>>>>>> >
>>>>>> >                                         This is what jsonp would
>>>>>> allow
>>>>>> >                                         with the benefit of a
>>>>>> natural
>>>>>> >                                         pojo support through jsonb.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>>> >                                             12:28 PM Romain
>>>>>> Manni-Bucau
>>>>>> >                                             <rmannibucau@gmail.com
>>>>>> >                                             <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                             wrote:
>>>>>> >
>>>>>> >                                                 Just to let it be
>>>>>> clear
>>>>>> >                                                 and let me
>>>>>> understand:
>>>>>> >                                                 how is BeamRecord
>>>>>> >                                                 different from a
>>>>>> >                                                 JsonObject which is
>>>>>> an
>>>>>> >                                                 API without
>>>>>> >                                                 implementation (not
>>>>>> >                                                 event a json one
>>>>>> OOTB)?
>>>>>> >                                                 Advantage of json
>>>>>> *api*
>>>>>> >                                                 are indeed natural
>>>>>> >                                                 mapping (jsonb is
>>>>>> based
>>>>>> >                                                 on jsonp so no new
>>>>>> >                                                 binding to
>>>>>> reinvent) and
>>>>>> >                                                 simple serialization
>>>>>> >                                                 (json+gzip for ex,
>>>>>> or
>>>>>> >                                                 avro if you want to
>>>>>> be
>>>>>> >                                                 geeky).
>>>>>> >
>>>>>> >                                                 I fail to see the
>>>>>> point
>>>>>> >                                                 to rebuild an
>>>>>> ecosystem ATM.
>>>>>> >
>>>>>> >                                                 Le 26 avr. 2018
>>>>>> 19:12,
>>>>>> >                                                 "Reuven Lax"
>>>>>> >                                                 <relax@google.com
>>>>>> >                                                 <mailto:
>>>>>> relax@google.com>>
>>>>>> >                                                 a écrit :
>>>>>> >
>>>>>> >                                                     Exactly what JB
>>>>>> >                                                     said. We will
>>>>>> write
>>>>>> >                                                     a generic
>>>>>> conversion
>>>>>> >                                                     from Avro (or
>>>>>> json)
>>>>>> >                                                     to Beam schemas,
>>>>>> >                                                     which will make
>>>>>> them
>>>>>> >                                                     work
>>>>>> transparently
>>>>>> >                                                     with SQL. The
>>>>>> plan
>>>>>> >                                                     is also to
>>>>>> migrate
>>>>>> >                                                     Anton's work so
>>>>>> that
>>>>>> >                                                     POJOs works
>>>>>> >                                                     generically for
>>>>>> any
>>>>>> >                                                     schema.
>>>>>> >
>>>>>> >                                                     Reuven
>>>>>> >
>>>>>> >                                                     On Thu, Apr 26,
>>>>>> 2018
>>>>>> >                                                     at 1:17 AM
>>>>>> >                                                     Jean-Baptiste
>>>>>> Onofré
>>>>>> >                                                     <
>>>>>> jb@nanthrax.net
>>>>>> >                                                     <mailto:
>>>>>> jb@nanthrax.net>>
>>>>>> >                                                     wrote:
>>>>>> >
>>>>>> >                                                         For now we
>>>>>> have
>>>>>> >                                                         a generic
>>>>>> schema
>>>>>> >                                                         interface.
>>>>>> >                                                         Json-b can
>>>>>> be an
>>>>>> >                                                         impl, avro
>>>>>> could
>>>>>> >                                                         be another
>>>>>> one.
>>>>>> >
>>>>>> >                                                         Regards
>>>>>> >                                                         JB
>>>>>> >                                                         Le 26 avr.
>>>>>> 2018,
>>>>>> >                                                         à 12:08,
>>>>>> Romain
>>>>>> >                                                         Manni-Bucau
>>>>>> >                                                         <
>>>>>> rmannibucau@gmail.com
>>>>>> >                                                         <mailto:
>>>>>> rmannibucau@gmail.com>>
>>>>>> >                                                         a écrit:
>>>>>> >
>>>>>> >                                                             Hmm,
>>>>>> >
>>>>>> >                                                             avro has
>>>>>> >                                                             still
>>>>>> the
>>>>>> >
>>>>>>  pitfalls to
>>>>>> >                                                             have an
>>>>>> >
>>>>>>  uncontrolled
>>>>>> >                                                             stack
>>>>>> which
>>>>>> >                                                             brings
>>>>>> way
>>>>>> >                                                             too much
>>>>>> >
>>>>>>  dependencies
>>>>>> >                                                             to be
>>>>>> part
>>>>>> >                                                             of any
>>>>>> API,
>>>>>> >                                                             this is
>>>>>> why
>>>>>> >                                                             I
>>>>>> proposed a
>>>>>> >                                                             JSON-P
>>>>>> based
>>>>>> >                                                             API
>>>>>> >
>>>>>>  (JsonObject)
>>>>>> >                                                             with a
>>>>>> >                                                             custom
>>>>>> beam
>>>>>> >                                                             entry
>>>>>> for
>>>>>> >                                                             some
>>>>>> >                                                             metadata
>>>>>> >
>>>>>>  (headers "à
>>>>>> >                                                             la
>>>>>> Camel").
>>>>>> >
>>>>>> >
>>>>>> >                                                             Romain
>>>>>> >
>>>>>>  Manni-Bucau
>>>>>> >
>>>>>>  @rmannibucau
>>>>>> >                                                             <
>>>>>> https://twitter.com/rmannibucau>
>>>>>> >                                                             |   Blog
>>>>>> >                                                             <
>>>>>> https://rmannibucau.metawerx.net/> |
>>>>>> >                                                             Old Blog
>>>>>> >                                                             <
>>>>>> http://rmannibucau.wordpress.com>
>>>>>> >                                                             |
>>>>>> Github
>>>>>> >                                                             <
>>>>>> https://github.com/rmannibucau> |
>>>>>> >                                                             LinkedIn
>>>>>> >                                                             <
>>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>>> >                                                             Book
>>>>>> >                                                             <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>  2018-04-26
>>>>>> >                                                             9:59
>>>>>> >
>>>>>>  GMT+02:00
>>>>>> >
>>>>>>  Jean-Baptiste Onofré
>>>>>> >                                                             <
>>>>>> jb@nanthrax.net
>>>>>> >                                                             <mailto:
>>>>>> jb@nanthrax.net>>:
>>>>>> >
>>>>>> >
>>>>>> >                                                                 Hi
>>>>>> Ismael
>>>>>> >
>>>>>> >                                                                 You
>>>>>> mean
>>>>>> >
>>>>>>  directly
>>>>>> >                                                                 in
>>>>>> Beam
>>>>>> >                                                                 SQL
>>>>>> ?
>>>>>> >
>>>>>> >                                                                 That
>>>>>> >
>>>>>>  will be
>>>>>> >
>>>>>>  part of
>>>>>> >
>>>>>>  schema
>>>>>> >
>>>>>>  support:
>>>>>> >
>>>>>>  generic
>>>>>> >
>>>>>>  record
>>>>>> >
>>>>>>  could be
>>>>>> >                                                                 one
>>>>>> of
>>>>>> >                                                                 the
>>>>>> >
>>>>>>  payload
>>>>>> >                                                                 with
>>>>>> >
>>>>>>  across
>>>>>> >
>>>>>>  schema.
>>>>>> >
>>>>>> >
>>>>>>  Regards
>>>>>> >                                                                 JB
>>>>>> >                                                                 Le
>>>>>> 26
>>>>>> >                                                                 avr.
>>>>>> >
>>>>>>  2018, à
>>>>>> >
>>>>>>  11:39,
>>>>>> >
>>>>>>  "Ismaël
>>>>>> >
>>>>>>  Mejía" <
>>>>>> >
>>>>>> iemejia@gmail.com
>>>>>> >
>>>>>>  <ma...@gmail.com>>
>>>>>> >                                                                 a
>>>>>> écrit:
>>>>>> >
>>>>>> >
>>>>>>  Hello Anton,
>>>>>> >
>>>>>> >
>>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>>> >
>>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> >
>>>>>>  is a natural fit for this approach too.
>>>>>> >
>>>>>> >
>>>>>>  Regards,
>>>>>> >
>>>>>>  Ismaël
>>>>>> >
>>>>>> >
>>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>>> >
>>>>>>  <ma...@google.com>> wrote:
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> >
>>>>>> >
>>>>>>      Hi,
>>>>>> >
>>>>>> >
>>>>>>      I want
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      highlight
>>>>>> >
>>>>>>      a couple
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      improvements
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      we
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      been
>>>>>> >
>>>>>> >
>>>>>>      working
>>>>>> >
>>>>>>      on
>>>>>> >
>>>>>>      recently
>>>>>> >
>>>>>>      which
>>>>>> >
>>>>>>      are
>>>>>> >
>>>>>>      targeted
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      make
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      API
>>>>>> >
>>>>>>      easier
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      use.
>>>>>> >
>>>>>> >
>>>>>>      Specifically
>>>>>> >
>>>>>>      these
>>>>>> >
>>>>>>      features
>>>>>> >
>>>>>>      simplify
>>>>>> >
>>>>>>      conversion
>>>>>> >
>>>>>>      of
>>>>>> >
>>>>>>      Java
>>>>>> >
>>>>>>      Beans
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      JSON
>>>>>> >
>>>>>> >
>>>>>>      strings
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Feel
>>>>>> >
>>>>>>      free
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      try
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      send
>>>>>> >
>>>>>>      any
>>>>>> >
>>>>>>      bugs/comments/PRs
>>>>>> >
>>>>>>      my
>>>>>> >
>>>>>>      way.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      **Caveat:
>>>>>> >
>>>>>>      this
>>>>>> >
>>>>>>      is
>>>>>> >
>>>>>>      still
>>>>>> >
>>>>>>      work
>>>>>> >
>>>>>>      in
>>>>>> >
>>>>>>      progress,
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      has
>>>>>> >
>>>>>>      known
>>>>>> >
>>>>>>      bugs
>>>>>> >
>>>>>>      and
>>>>>> >
>>>>>>      incomplete
>>>>>> >
>>>>>> >
>>>>>>      features,
>>>>>> >
>>>>>>      see
>>>>>> >
>>>>>>      below
>>>>>> >
>>>>>>      for
>>>>>> >
>>>>>>      details.**
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Background
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>      Beam
>>>>>> >
>>>>>>      SQL
>>>>>> >
>>>>>>      queries
>>>>>> >
>>>>>>      can
>>>>>> >
>>>>>>      only
>>>>>> >
>>>>>>      be
>>>>>> >
>>>>>>      applied
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      PCollection<Row>.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      means
>>>>>> >
>>>>>>      that
>>>>>> >
>>>>>> >
>>>>>>      users
>>>>>> >
>>>>>>      need
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      convert
>>>>>> >
>>>>>>      whatever
>>>>>> >
>>>>>>      PCollection
>>>>>> >
>>>>>>      elements
>>>>>> >
>>>>>>      they
>>>>>> >
>>>>>>      have
>>>>>> >
>>>>>>      to
>>>>>> >
>>>>>>      Rows
>>>>>> >
>>>>>>      before
>>>>>> >
>>>>>> >
>>>>>>      querying
>>>>>> >
>>>>>>      them
>>>>>> >
>>>>>>      with
>>>>>> >
>>>>>>      SQL.
>>>>>> >
>>>>>>      This
>>>>>> >
>>>>>>      usually
>>>>>> >
>>>>>>      requires
>>>>>> >
>>>>>>
>>>>>
>>>>>

Re: Beam SQL Improvements

Posted by Lukasz Cwik <lc...@google.com>.
Shouldn't the runner isolate each instance of the pipeline behind an
appropriate class loader?

On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax <re...@google.com> wrote:

> Just an update: Romain and I chatted on Slack, and I think I understand
> his concern. The concern wasn't specifically about schemas, rather about
> having a generic way to register per-ParDo state that has worker lifetime.
> As evidence that such is needed, in many cases static variables are used to
> simiulate that. static variables however have downsides - if two pipelines
> are run on the same JVM (happens often with unit tests, and there's nothing
> that prevents a runner from doing so in a production environment), these
> static variables will interfere with each other.
>
> On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:
>
>> Romain, maybe it would be useful for us to find some time on slack. I'd
>> like to understand your concerns. Also keep in mind that I'm tagging all
>> these classes as Experimental for now, so we can definitely change these
>> interfaces around if we decide they are not the best ones.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Why not extending ProcessContext to add the new remapped output? But
>>> looks good (the part i dont like is that creating a new context each time a
>>> new feature is added is hurting users. What when beam will add some
>>> reactive support? ReactiveOutputReceiver?)
>>>
>>> Pipeline sounds the wrong storage since once distributed you serialized
>>> the instances so kind of broke the lifecycle of the original instance and
>>> have no real release/close hook on them anymore right? Not sure we can do
>>> better than dofn/source embedded instances today.
>>>
>>>
>>>
>>>
>>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>>
>>>>> It would be easier for users.
>>>>>
>>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>>> we do in Camel: Beam could check the source/destination "type" and
>>>>> check
>>>>> in the map if there's a converter available. This map can be store as
>>>>> part of the pipeline (as we do for filesystem registration).
>>>>>
>>>>
>>>>
>>>> It works in camel because it is not strongly typed, isnt it? So can
>>>> require a beam new pipeline api.
>>>>
>>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>>> wouldnt break the fluent api:
>>>>
>>>> p.apply(io).setOutputType(Foo.class)
>>>>
>>>> Coders can be a workaround since they owns the type but since the
>>>> pcollection is the real owner it is surely saner this way, no?
>>>>
>>>> Also it needs to ensure all converters are present before running the
>>>> pipeline probably, no implicit environment converter support is probably
>>>> good to start to avoid late surprises.
>>>>
>>>>
>>>>
>>>>> My $0.01
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>>> > How does it work on the pipeline side?
>>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>>> > API to work not erasing generics?
>>>>> >
>>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>>> > SQL(row)->BigQuery(row)
>>>>> >
>>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>>> pretask
>>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>>> duplicate
>>>>> > it or not track it properly. These kind of converters will need a
>>>>> global
>>>>> > close and not only per record in general:
>>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>>> add
>>>>> > it to setup/teardown pby, being more global would be nicer but is
>>>>> more
>>>>> > challenging.
>>>>> >
>>>>> > Romain Manni-Bucau
>>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> > <http://rmannibucau.wordpress.com> | Github
>>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> > <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>>> > <ma...@google.com>> a écrit :
>>>>> >
>>>>> >     No - the only modules we need to add to core are the ones we
>>>>> choose
>>>>> >     to add. For example, I will probably add a registration for
>>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>>> >     with schemas. However I will add that to the GCP module, so only
>>>>> >     someone depending on that module need to pull in that dependency.
>>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>>> >     register schemas for their types (we already do something similar
>>>>> >     for FileSystem and for coders as well).
>>>>> >
>>>>> >     BTW, right now the conversion back and forth between Row objects
>>>>> I'm
>>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>>> order
>>>>> >     to invoke DoFns.
>>>>> >
>>>>> >     Reuven
>>>>> >
>>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>>> >
>>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>>> with
>>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>>> >         schema API)
>>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>>> just
>>>>> >         trying to understand here)?
>>>>> >         If so, how an IO can use as() with the type it expects?
>>>>> Doesnt
>>>>> >         it lead to have a tons of  these modules at the end?
>>>>> >
>>>>> >         Romain Manni-Bucau
>>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>>> >         <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>>> >         <ma...@google.com>> a écrit :
>>>>> >
>>>>> >             By the way Romain, if you have specific scenarios in
>>>>> mind I
>>>>> >             would love to hear them. I can try and guess what exactly
>>>>> >             you would like to get out of schemas, but it would work
>>>>> >             better if you gave me concrete scenarios that you would
>>>>> like
>>>>> >             to work.
>>>>> >
>>>>> >             Reuven
>>>>> >
>>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>>> relax@google.com
>>>>> >             <ma...@google.com>> wrote:
>>>>> >
>>>>> >                 Yeah, what I'm working on will help with IO.
>>>>> Basically
>>>>> >                 if you register a function with SchemaRegistry that
>>>>> >                 converts back and forth between a type (say
>>>>> JsonObject)
>>>>> >                 and a Beam Row, then it is applied by the framework
>>>>> >                 behind the scenes as part of DoFn invocation.
>>>>> Concrete
>>>>> >                 example: let's say I have an IO that reads json
>>>>> objects
>>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>>> >                 JsonObject> {...}
>>>>> >
>>>>> >                 If you register a schema for this type (or you can
>>>>> also
>>>>> >                 just set the schema directly on the output
>>>>> PCollection),
>>>>> >                 then Beam knows how to convert back and forth between
>>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>>> >
>>>>> >                 p.apply(new MyJsonIORead())
>>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>> >                     @ProcessElement void process(@Element Row row) {
>>>>> >                    })
>>>>> >
>>>>> >                 And Beam will automatically convert JsonObject to a
>>>>> Row
>>>>> >                 for processing (you aren't forced to do this of
>>>>> course -
>>>>> >                 you can always ask for it as a JsonObject).
>>>>> >
>>>>> >                 The same is true for output. If you have a sink that
>>>>> >                 takes in JsonObject but the transform before it
>>>>> produces
>>>>> >                 Row objects (for instance - because the transform
>>>>> before
>>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>>> back
>>>>> >                 to JsonObject for you.
>>>>> >
>>>>> >                 All of this was detailed in the Schema doc I shared a
>>>>> >                 few months ago. There was a lot of discussion on that
>>>>> >                 document from various parties, and some of this API
>>>>> is a
>>>>> >                 result of that discussion. This is also working in
>>>>> the
>>>>> >                 branch JB and I were working on, though not yet
>>>>> >                 integrated back to master.
>>>>> >
>>>>> >                 I would like to actually go further and make Row an
>>>>> >                 interface and provide a way to automatically put a
>>>>> Row
>>>>> >                 interface on top of any other object (e.g.
>>>>> JsonObject,
>>>>> >                 Pojo, etc.) This won't change the way the user writes
>>>>> >                 code, but instead of Beam having to copy and convert
>>>>> at
>>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>>> will
>>>>> >                 create a Row object that uses the the JsonObject as
>>>>> its
>>>>> >                 underlying storage.
>>>>> >
>>>>> >                 Reuven
>>>>> >
>>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>>> >                 <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>>> >>
>>>>> >                 wrote:
>>>>> >
>>>>> >                     Well, beam can implement a new mapper but it
>>>>> doesnt
>>>>> >                     help for io. Most of modern backends will take
>>>>> json
>>>>> >                     directly, even javax one and it must stay
>>>>> generic.
>>>>> >
>>>>> >                     Then since json to pojo mapping is already done a
>>>>> >                     dozen of times, not sure it is worth it for now.
>>>>> >
>>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>>> >                     <relax@google.com <ma...@google.com>> a
>>>>> écrit :
>>>>> >
>>>>> >                         We can do even better btw. Building a
>>>>> >                         SchemaRegistry where automatic conversions
>>>>> can
>>>>> >                         be registered between schema and Java data
>>>>> >                         types. With this the user won't even need a
>>>>> DoFn
>>>>> >                         to do the conversion.
>>>>> >
>>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>>> >                         <ma...@gmail.com>> wrote:
>>>>> >
>>>>> >                             Hi guys,
>>>>> >
>>>>> >                             Checked out what has been done on schema
>>>>> >                             model and think it is acceptable -
>>>>> regarding
>>>>> >                             the json debate -
>>>>> >                             if
>>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>>> >                             can be fixed.
>>>>> >
>>>>> >                             High level, it is about providing a
>>>>> >                             mainstream and not too impacting model
>>>>> OOTB
>>>>> >                             and JSON seems the most valid option for
>>>>> >                             now, at least for IO and some user
>>>>> transforms.
>>>>> >
>>>>> >                             Wdyt?
>>>>> >
>>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>>> >                             <ma...@gmail.com>> a
>>>>> écrit :
>>>>> >
>>>>> >                                  Can give it a try end of may, sure.
>>>>> >                                 (holidays and work constraints will
>>>>> make
>>>>> >                                 it hard before).
>>>>> >
>>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>>> >                                 <kedin@google.com
>>>>> >                                 <ma...@google.com>> a écrit :
>>>>> >
>>>>> >                                     Romain,
>>>>> >
>>>>> >                                     I don't believe that JSON
>>>>> approach
>>>>> >                                     was investigated very
>>>>> thoroughIy. I
>>>>> >                                     mentioned few reasons which will
>>>>> >                                     make it not the best choice my
>>>>> >                                     opinion, but I may be wrong. Can
>>>>> you
>>>>> >                                     put together a design doc or a
>>>>> >                                     prototype?
>>>>> >
>>>>> >                                     Thank you,
>>>>> >                                     Anton
>>>>> >
>>>>> >
>>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>>> >                                     Romain Manni-Bucau
>>>>> >                                     <rmannibucau@gmail.com
>>>>> >                                     <ma...@gmail.com>>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>>>> >                                         Kedin" <kedin@google.com
>>>>> >                                         <ma...@google.com>>
>>>>> a écrit :
>>>>> >
>>>>> >                                             BeamRecord (Row) has very
>>>>> >                                             little in common with
>>>>> >                                             JsonObject (I assume
>>>>> you're
>>>>> >                                             talking about
>>>>> javax.json),
>>>>> >                                             except maybe some
>>>>> >                                             similarities of the API.
>>>>> Few
>>>>> >                                             reasons why JsonObject
>>>>> >                                             doesn't work:
>>>>> >
>>>>> >                                               * it is a Java EE API:
>>>>> >                                                   o Beam SDK is not
>>>>> >                                                     limited to Java.
>>>>> >                                                     There are
>>>>> probably
>>>>> >                                                     similar APIs for
>>>>> >                                                     other languages
>>>>> but
>>>>> >                                                     they might not
>>>>> >                                                     necessarily carry
>>>>> >                                                     the same
>>>>> semantics /
>>>>> >                                                     APIs;
>>>>> >
>>>>> >
>>>>> >                                         Not a big deal I think. At
>>>>> least
>>>>> >                                         not a technical blocker.
>>>>> >
>>>>> >                                                   o It can change
>>>>> >                                                     between Java
>>>>> versions;
>>>>> >
>>>>> >                                         No, this is javaee ;).
>>>>> >
>>>>> >
>>>>> >                                                   o Current Beam java
>>>>> >                                                     implementation
>>>>> is an
>>>>> >                                                     experimental
>>>>> feature
>>>>> >                                                     to identify
>>>>> what's
>>>>> >                                                     needed from such
>>>>> >                                                     API, in the end
>>>>> we
>>>>> >                                                     might end up with
>>>>> >                                                     something
>>>>> similar to
>>>>> >                                                     JsonObject API,
>>>>> but
>>>>> >                                                     likely not
>>>>> >
>>>>> >
>>>>> >                                         I dont get that point as a
>>>>> blocker
>>>>> >
>>>>> >                                                   o ;
>>>>> >                                               * represents JSON,
>>>>> which
>>>>> >                                                 is not an API but an
>>>>> >                                                 object notation:
>>>>> >                                                   o it is defined as
>>>>> >                                                     unicode string
>>>>> in a
>>>>> >                                                     certain format.
>>>>> If
>>>>> >                                                     you choose to
>>>>> adhere
>>>>> >                                                     to ECMA-404,
>>>>> then it
>>>>> >                                                     doesn't sound
>>>>> like
>>>>> >                                                     JsonObject can
>>>>> >                                                     represent an Avro
>>>>> >                                                     object, if I'm
>>>>> >                                                     reading it right;
>>>>> >
>>>>> >
>>>>> >                                         It is in the generator impl,
>>>>> you
>>>>> >                                         can impl an avrogenerator.
>>>>> >
>>>>> >                                               * doesn't define a type
>>>>> >                                                 system (JSON does,
>>>>> but
>>>>> >                                                 it's lacking):
>>>>> >                                                   o for example, JSON
>>>>> >                                                     doesn't define
>>>>> >                                                     semantics for
>>>>> numbers;
>>>>> >                                                   o doesn't define
>>>>> >                                                     date/time types;
>>>>> >                                                   o doesn't allow
>>>>> >                                                     extending JSON
>>>>> type
>>>>> >                                                     system at all;
>>>>> >
>>>>> >
>>>>> >                                         That is why you need a metada
>>>>> >                                         object, or simpler, a schema
>>>>> >                                         with that data. Json or beam
>>>>> >                                         record doesnt help here and
>>>>> you
>>>>> >                                         end up on the same outcome if
>>>>> >                                         you think about it.
>>>>> >
>>>>> >                                               * lacks schemas;
>>>>> >
>>>>> >                                         Jsonschema are standard,
>>>>> widely
>>>>> >                                         spread and tooled compared to
>>>>> >                                         alternative.
>>>>> >
>>>>> >                                             You can definitely try
>>>>> >                                             loosen the requirements
>>>>> and
>>>>> >                                             define everything in
>>>>> JSON in
>>>>> >                                             userland, but the point
>>>>> of
>>>>> >                                             Row/Schema is to avoid it
>>>>> >                                             and define everything in
>>>>> >                                             Beam model, which can be
>>>>> >                                             extended, mapped to JSON,
>>>>> >                                             Avro, BigQuery Schemas,
>>>>> >                                             custom binary format
>>>>> etc.,
>>>>> >                                             with same semantics
>>>>> across
>>>>> >                                             beam SDKs.
>>>>> >
>>>>> >
>>>>> >                                         This is what jsonp would
>>>>> allow
>>>>> >                                         with the benefit of a natural
>>>>> >                                         pojo support through jsonb.
>>>>> >
>>>>> >
>>>>> >
>>>>> >                                             On Thu, Apr 26, 2018 at
>>>>> >                                             12:28 PM Romain
>>>>> Manni-Bucau
>>>>> >                                             <rmannibucau@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> >                                             wrote:
>>>>> >
>>>>> >                                                 Just to let it be
>>>>> clear
>>>>> >                                                 and let me
>>>>> understand:
>>>>> >                                                 how is BeamRecord
>>>>> >                                                 different from a
>>>>> >                                                 JsonObject which is
>>>>> an
>>>>> >                                                 API without
>>>>> >                                                 implementation (not
>>>>> >                                                 event a json one
>>>>> OOTB)?
>>>>> >                                                 Advantage of json
>>>>> *api*
>>>>> >                                                 are indeed natural
>>>>> >                                                 mapping (jsonb is
>>>>> based
>>>>> >                                                 on jsonp so no new
>>>>> >                                                 binding to reinvent)
>>>>> and
>>>>> >                                                 simple serialization
>>>>> >                                                 (json+gzip for ex, or
>>>>> >                                                 avro if you want to
>>>>> be
>>>>> >                                                 geeky).
>>>>> >
>>>>> >                                                 I fail to see the
>>>>> point
>>>>> >                                                 to rebuild an
>>>>> ecosystem ATM.
>>>>> >
>>>>> >                                                 Le 26 avr. 2018
>>>>> 19:12,
>>>>> >                                                 "Reuven Lax"
>>>>> >                                                 <relax@google.com
>>>>> >                                                 <mailto:
>>>>> relax@google.com>>
>>>>> >                                                 a écrit :
>>>>> >
>>>>> >                                                     Exactly what JB
>>>>> >                                                     said. We will
>>>>> write
>>>>> >                                                     a generic
>>>>> conversion
>>>>> >                                                     from Avro (or
>>>>> json)
>>>>> >                                                     to Beam schemas,
>>>>> >                                                     which will make
>>>>> them
>>>>> >                                                     work
>>>>> transparently
>>>>> >                                                     with SQL. The
>>>>> plan
>>>>> >                                                     is also to
>>>>> migrate
>>>>> >                                                     Anton's work so
>>>>> that
>>>>> >                                                     POJOs works
>>>>> >                                                     generically for
>>>>> any
>>>>> >                                                     schema.
>>>>> >
>>>>> >                                                     Reuven
>>>>> >
>>>>> >                                                     On Thu, Apr 26,
>>>>> 2018
>>>>> >                                                     at 1:17 AM
>>>>> >                                                     Jean-Baptiste
>>>>> Onofré
>>>>> >                                                     <jb@nanthrax.net
>>>>> >                                                     <mailto:
>>>>> jb@nanthrax.net>>
>>>>> >                                                     wrote:
>>>>> >
>>>>> >                                                         For now we
>>>>> have
>>>>> >                                                         a generic
>>>>> schema
>>>>> >                                                         interface.
>>>>> >                                                         Json-b can
>>>>> be an
>>>>> >                                                         impl, avro
>>>>> could
>>>>> >                                                         be another
>>>>> one.
>>>>> >
>>>>> >                                                         Regards
>>>>> >                                                         JB
>>>>> >                                                         Le 26 avr.
>>>>> 2018,
>>>>> >                                                         à 12:08,
>>>>> Romain
>>>>> >                                                         Manni-Bucau
>>>>> >                                                         <
>>>>> rmannibucau@gmail.com
>>>>> >                                                         <mailto:
>>>>> rmannibucau@gmail.com>>
>>>>> >                                                         a écrit:
>>>>> >
>>>>> >                                                             Hmm,
>>>>> >
>>>>> >                                                             avro has
>>>>> >                                                             still the
>>>>> >                                                             pitfalls
>>>>> to
>>>>> >                                                             have an
>>>>> >
>>>>>  uncontrolled
>>>>> >                                                             stack
>>>>> which
>>>>> >                                                             brings
>>>>> way
>>>>> >                                                             too much
>>>>> >
>>>>>  dependencies
>>>>> >                                                             to be
>>>>> part
>>>>> >                                                             of any
>>>>> API,
>>>>> >                                                             this is
>>>>> why
>>>>> >                                                             I
>>>>> proposed a
>>>>> >                                                             JSON-P
>>>>> based
>>>>> >                                                             API
>>>>> >
>>>>>  (JsonObject)
>>>>> >                                                             with a
>>>>> >                                                             custom
>>>>> beam
>>>>> >                                                             entry for
>>>>> >                                                             some
>>>>> >                                                             metadata
>>>>> >                                                             (headers
>>>>> "à
>>>>> >                                                             la
>>>>> Camel").
>>>>> >
>>>>> >
>>>>> >                                                             Romain
>>>>> >
>>>>>  Manni-Bucau
>>>>> >
>>>>>  @rmannibucau
>>>>> >                                                             <
>>>>> https://twitter.com/rmannibucau>
>>>>> >                                                             |   Blog
>>>>> >                                                             <
>>>>> https://rmannibucau.metawerx.net/> |
>>>>> >                                                             Old Blog
>>>>> >                                                             <
>>>>> http://rmannibucau.wordpress.com>
>>>>> >                                                             |  Github
>>>>> >                                                             <
>>>>> https://github.com/rmannibucau> |
>>>>> >                                                             LinkedIn
>>>>> >                                                             <
>>>>> https://www.linkedin.com/in/rmannibucau> |
>>>>> >                                                             Book
>>>>> >                                                             <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>  2018-04-26
>>>>> >                                                             9:59
>>>>> >                                                             GMT+02:00
>>>>> >
>>>>>  Jean-Baptiste Onofré
>>>>> >                                                             <
>>>>> jb@nanthrax.net
>>>>> >                                                             <mailto:
>>>>> jb@nanthrax.net>>:
>>>>> >
>>>>> >
>>>>> >                                                                 Hi
>>>>> Ismael
>>>>> >
>>>>> >                                                                 You
>>>>> mean
>>>>> >
>>>>>  directly
>>>>> >                                                                 in
>>>>> Beam
>>>>> >                                                                 SQL ?
>>>>> >
>>>>> >                                                                 That
>>>>> >                                                                 will
>>>>> be
>>>>> >                                                                 part
>>>>> of
>>>>> >
>>>>>  schema
>>>>> >
>>>>>  support:
>>>>> >
>>>>>  generic
>>>>> >
>>>>>  record
>>>>> >
>>>>>  could be
>>>>> >                                                                 one
>>>>> of
>>>>> >                                                                 the
>>>>> >
>>>>>  payload
>>>>> >                                                                 with
>>>>> >
>>>>>  across
>>>>> >
>>>>>  schema.
>>>>> >
>>>>> >
>>>>>  Regards
>>>>> >                                                                 JB
>>>>> >                                                                 Le 26
>>>>> >                                                                 avr.
>>>>> >
>>>>>  2018, à
>>>>> >
>>>>>  11:39,
>>>>> >
>>>>>  "Ismaël
>>>>> >
>>>>>  Mejía" <
>>>>> >
>>>>> iemejia@gmail.com
>>>>> >
>>>>>  <ma...@gmail.com>>
>>>>> >                                                                 a
>>>>> écrit:
>>>>> >
>>>>> >
>>>>>  Hello Anton,
>>>>> >
>>>>> >
>>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>>> >
>>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>> >
>>>>>  is a natural fit for this approach too.
>>>>> >
>>>>> >
>>>>>  Regards,
>>>>> >
>>>>>  Ismaël
>>>>> >
>>>>> >
>>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>>> >
>>>>>  <ma...@google.com>> wrote:
>>>>> >
>>>>> >
>>>>>
>>>>> >
>>>>> >
>>>>>    Hi,
>>>>> >
>>>>> >
>>>>>    I want
>>>>> >
>>>>>    to
>>>>> >
>>>>>    highlight
>>>>> >
>>>>>    a couple
>>>>> >
>>>>>    of
>>>>> >
>>>>>    improvements
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    we
>>>>> >
>>>>>    have
>>>>> >
>>>>>    been
>>>>> >
>>>>> >
>>>>>    working
>>>>> >
>>>>>    on
>>>>> >
>>>>>    recently
>>>>> >
>>>>>    which
>>>>> >
>>>>>    are
>>>>> >
>>>>>    targeted
>>>>> >
>>>>>    to
>>>>> >
>>>>>    make
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    API
>>>>> >
>>>>>    easier
>>>>> >
>>>>>    to
>>>>> >
>>>>>    use.
>>>>> >
>>>>> >
>>>>>    Specifically
>>>>> >
>>>>>    these
>>>>> >
>>>>>    features
>>>>> >
>>>>>    simplify
>>>>> >
>>>>>    conversion
>>>>> >
>>>>>    of
>>>>> >
>>>>>    Java
>>>>> >
>>>>>    Beans
>>>>> >
>>>>>    and
>>>>> >
>>>>>    JSON
>>>>> >
>>>>> >
>>>>>    strings
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Rows.
>>>>> >
>>>>> >
>>>>> >
>>>>>    Feel
>>>>> >
>>>>>    free
>>>>> >
>>>>>    to
>>>>> >
>>>>>    try
>>>>> >
>>>>>    this
>>>>> >
>>>>>    and
>>>>> >
>>>>>    send
>>>>> >
>>>>>    any
>>>>> >
>>>>>    bugs/comments/PRs
>>>>> >
>>>>>    my
>>>>> >
>>>>>    way.
>>>>> >
>>>>> >
>>>>> >
>>>>>    **Caveat:
>>>>> >
>>>>>    this
>>>>> >
>>>>>    is
>>>>> >
>>>>>    still
>>>>> >
>>>>>    work
>>>>> >
>>>>>    in
>>>>> >
>>>>>    progress,
>>>>> >
>>>>>    and
>>>>> >
>>>>>    has
>>>>> >
>>>>>    known
>>>>> >
>>>>>    bugs
>>>>> >
>>>>>    and
>>>>> >
>>>>>    incomplete
>>>>> >
>>>>> >
>>>>>    features,
>>>>> >
>>>>>    see
>>>>> >
>>>>>    below
>>>>> >
>>>>>    for
>>>>> >
>>>>>    details.**
>>>>> >
>>>>> >
>>>>> >
>>>>>    Background
>>>>> >
>>>>> >
>>>>> >
>>>>>    Beam
>>>>> >
>>>>>    SQL
>>>>> >
>>>>>    queries
>>>>> >
>>>>>    can
>>>>> >
>>>>>    only
>>>>> >
>>>>>    be
>>>>> >
>>>>>    applied
>>>>> >
>>>>>    to
>>>>> >
>>>>>    PCollection<Row>.
>>>>> >
>>>>>    This
>>>>> >
>>>>>    means
>>>>> >
>>>>>    that
>>>>> >
>>>>> >
>>>>>    users
>>>>> >
>>>>>    need
>>>>> >
>>>>>    to
>>>>> >
>>>>>    convert
>>>>> >
>>>>>    whatever
>>>>> >
>>>>>    PCollection
>>>>> >
>>>>>    elements
>>>>> >
>>>>>    they
>>>>> >
>>>>>    have
>>>>> >
>>>>>    to
>>>>> >
>>>>>    Rows
>>>>> >
>>>>>    before
>>>>> >
>>>>> >
>>>>>    querying
>>>>> >
>>>>>    them
>>>>> >
>>>>>    with
>>>>> >
>>>>>    SQL.
>>>>> >
>>>>>    This
>>>>> >
>>>>>    usually
>>>>> >
>>>>>    requires
>>>>> >
>>>>>
>>>>
>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Just an update: Romain and I chatted on Slack, and I think I understand his
concern. The concern wasn't specifically about schemas, rather about having
a generic way to register per-ParDo state that has worker lifetime. As
evidence that such is needed, in many cases static variables are used to
simiulate that. static variables however have downsides - if two pipelines
are run on the same JVM (happens often with unit tests, and there's nothing
that prevents a runner from doing so in a production environment), these
static variables will interfere with each other.

On Thu, May 24, 2018 at 12:30 AM Reuven Lax <re...@google.com> wrote:

> Romain, maybe it would be useful for us to find some time on slack. I'd
> like to understand your concerns. Also keep in mind that I'm tagging all
> these classes as Experimental for now, so we can definitely change these
> interfaces around if we decide they are not the best ones.
>
> Reuven
>
> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Why not extending ProcessContext to add the new remapped output? But
>> looks good (the part i dont like is that creating a new context each time a
>> new feature is added is hurting users. What when beam will add some
>> reactive support? ReactiveOutputReceiver?)
>>
>> Pipeline sounds the wrong storage since once distributed you serialized
>> the instances so kind of broke the lifecycle of the original instance and
>> have no real release/close hook on them anymore right? Not sure we can do
>> better than dofn/source embedded instances today.
>>
>>
>>
>>
>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>> écrit :
>>
>>>
>>>
>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>
>>>> It would be easier for users.
>>>>
>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>> we do in Camel: Beam could check the source/destination "type" and check
>>>> in the map if there's a converter available. This map can be store as
>>>> part of the pipeline (as we do for filesystem registration).
>>>>
>>>
>>>
>>> It works in camel because it is not strongly typed, isnt it? So can
>>> require a beam new pipeline api.
>>>
>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>> wouldnt break the fluent api:
>>>
>>> p.apply(io).setOutputType(Foo.class)
>>>
>>> Coders can be a workaround since they owns the type but since the
>>> pcollection is the real owner it is surely saner this way, no?
>>>
>>> Also it needs to ensure all converters are present before running the
>>> pipeline probably, no implicit environment converter support is probably
>>> good to start to avoid late surprises.
>>>
>>>
>>>
>>>> My $0.01
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>> > How does it work on the pipeline side?
>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>> > API to work not erasing generics?
>>>> >
>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>> > SQL(row)->BigQuery(row)
>>>> >
>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>> pretask
>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>> duplicate
>>>> > it or not track it properly. These kind of converters will need a
>>>> global
>>>> > close and not only per record in general:
>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>> add
>>>> > it to setup/teardown pby, being more global would be nicer but is more
>>>> > challenging.
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> >     No - the only modules we need to add to core are the ones we
>>>> choose
>>>> >     to add. For example, I will probably add a registration for
>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>> >     with schemas. However I will add that to the GCP module, so only
>>>> >     someone depending on that module need to pull in that dependency.
>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>> >     register schemas for their types (we already do something similar
>>>> >     for FileSystem and for coders as well).
>>>> >
>>>> >     BTW, right now the conversion back and forth between Row objects
>>>> I'm
>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>> order
>>>> >     to invoke DoFns.
>>>> >
>>>> >     Reuven
>>>> >
>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>> >
>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>> with
>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>> >         schema API)
>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>> just
>>>> >         trying to understand here)?
>>>> >         If so, how an IO can use as() with the type it expects? Doesnt
>>>> >         it lead to have a tons of  these modules at the end?
>>>> >
>>>> >         Romain Manni-Bucau
>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>> >         <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>> >         <ma...@google.com>> a écrit :
>>>> >
>>>> >             By the way Romain, if you have specific scenarios in mind
>>>> I
>>>> >             would love to hear them. I can try and guess what exactly
>>>> >             you would like to get out of schemas, but it would work
>>>> >             better if you gave me concrete scenarios that you would
>>>> like
>>>> >             to work.
>>>> >
>>>> >             Reuven
>>>> >
>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>> relax@google.com
>>>> >             <ma...@google.com>> wrote:
>>>> >
>>>> >                 Yeah, what I'm working on will help with IO. Basically
>>>> >                 if you register a function with SchemaRegistry that
>>>> >                 converts back and forth between a type (say
>>>> JsonObject)
>>>> >                 and a Beam Row, then it is applied by the framework
>>>> >                 behind the scenes as part of DoFn invocation. Concrete
>>>> >                 example: let's say I have an IO that reads json
>>>> objects
>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>> >                 JsonObject> {...}
>>>> >
>>>> >                 If you register a schema for this type (or you can
>>>> also
>>>> >                 just set the schema directly on the output
>>>> PCollection),
>>>> >                 then Beam knows how to convert back and forth between
>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>> >
>>>> >                 p.apply(new MyJsonIORead())
>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>> >                     @ProcessElement void process(@Element Row row) {
>>>> >                    })
>>>> >
>>>> >                 And Beam will automatically convert JsonObject to a
>>>> Row
>>>> >                 for processing (you aren't forced to do this of
>>>> course -
>>>> >                 you can always ask for it as a JsonObject).
>>>> >
>>>> >                 The same is true for output. If you have a sink that
>>>> >                 takes in JsonObject but the transform before it
>>>> produces
>>>> >                 Row objects (for instance - because the transform
>>>> before
>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>> back
>>>> >                 to JsonObject for you.
>>>> >
>>>> >                 All of this was detailed in the Schema doc I shared a
>>>> >                 few months ago. There was a lot of discussion on that
>>>> >                 document from various parties, and some of this API
>>>> is a
>>>> >                 result of that discussion. This is also working in the
>>>> >                 branch JB and I were working on, though not yet
>>>> >                 integrated back to master.
>>>> >
>>>> >                 I would like to actually go further and make Row an
>>>> >                 interface and provide a way to automatically put a Row
>>>> >                 interface on top of any other object (e.g. JsonObject,
>>>> >                 Pojo, etc.) This won't change the way the user writes
>>>> >                 code, but instead of Beam having to copy and convert
>>>> at
>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>> will
>>>> >                 create a Row object that uses the the JsonObject as
>>>> its
>>>> >                 underlying storage.
>>>> >
>>>> >                 Reuven
>>>> >
>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>> >                 <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>> >>
>>>> >                 wrote:
>>>> >
>>>> >                     Well, beam can implement a new mapper but it
>>>> doesnt
>>>> >                     help for io. Most of modern backends will take
>>>> json
>>>> >                     directly, even javax one and it must stay generic.
>>>> >
>>>> >                     Then since json to pojo mapping is already done a
>>>> >                     dozen of times, not sure it is worth it for now.
>>>> >
>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>> >                     <relax@google.com <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                         We can do even better btw. Building a
>>>> >                         SchemaRegistry where automatic conversions can
>>>> >                         be registered between schema and Java data
>>>> >                         types. With this the user won't even need a
>>>> DoFn
>>>> >                         to do the conversion.
>>>> >
>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>> >                         <ma...@gmail.com>> wrote:
>>>> >
>>>> >                             Hi guys,
>>>> >
>>>> >                             Checked out what has been done on schema
>>>> >                             model and think it is acceptable -
>>>> regarding
>>>> >                             the json debate -
>>>> >                             if
>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>> >                             can be fixed.
>>>> >
>>>> >                             High level, it is about providing a
>>>> >                             mainstream and not too impacting model
>>>> OOTB
>>>> >                             and JSON seems the most valid option for
>>>> >                             now, at least for IO and some user
>>>> transforms.
>>>> >
>>>> >                             Wdyt?
>>>> >
>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>> >                             <ma...@gmail.com>> a écrit :
>>>> >
>>>> >                                  Can give it a try end of may, sure.
>>>> >                                 (holidays and work constraints will
>>>> make
>>>> >                                 it hard before).
>>>> >
>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>> >                                 <kedin@google.com
>>>> >                                 <ma...@google.com>> a écrit :
>>>> >
>>>> >                                     Romain,
>>>> >
>>>> >                                     I don't believe that JSON approach
>>>> >                                     was investigated very thoroughIy.
>>>> I
>>>> >                                     mentioned few reasons which will
>>>> >                                     make it not the best choice my
>>>> >                                     opinion, but I may be wrong. Can
>>>> you
>>>> >                                     put together a design doc or a
>>>> >                                     prototype?
>>>> >
>>>> >                                     Thank you,
>>>> >                                     Anton
>>>> >
>>>> >
>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>> >                                     Romain Manni-Bucau
>>>> >                                     <rmannibucau@gmail.com
>>>> >                                     <ma...@gmail.com>>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>>> >                                         Kedin" <kedin@google.com
>>>> >                                         <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                                             BeamRecord (Row) has very
>>>> >                                             little in common with
>>>> >                                             JsonObject (I assume
>>>> you're
>>>> >                                             talking about javax.json),
>>>> >                                             except maybe some
>>>> >                                             similarities of the API.
>>>> Few
>>>> >                                             reasons why JsonObject
>>>> >                                             doesn't work:
>>>> >
>>>> >                                               * it is a Java EE API:
>>>> >                                                   o Beam SDK is not
>>>> >                                                     limited to Java.
>>>> >                                                     There are probably
>>>> >                                                     similar APIs for
>>>> >                                                     other languages
>>>> but
>>>> >                                                     they might not
>>>> >                                                     necessarily carry
>>>> >                                                     the same
>>>> semantics /
>>>> >                                                     APIs;
>>>> >
>>>> >
>>>> >                                         Not a big deal I think. At
>>>> least
>>>> >                                         not a technical blocker.
>>>> >
>>>> >                                                   o It can change
>>>> >                                                     between Java
>>>> versions;
>>>> >
>>>> >                                         No, this is javaee ;).
>>>> >
>>>> >
>>>> >                                                   o Current Beam java
>>>> >                                                     implementation is
>>>> an
>>>> >                                                     experimental
>>>> feature
>>>> >                                                     to identify what's
>>>> >                                                     needed from such
>>>> >                                                     API, in the end we
>>>> >                                                     might end up with
>>>> >                                                     something similar
>>>> to
>>>> >                                                     JsonObject API,
>>>> but
>>>> >                                                     likely not
>>>> >
>>>> >
>>>> >                                         I dont get that point as a
>>>> blocker
>>>> >
>>>> >                                                   o ;
>>>> >                                               * represents JSON, which
>>>> >                                                 is not an API but an
>>>> >                                                 object notation:
>>>> >                                                   o it is defined as
>>>> >                                                     unicode string in
>>>> a
>>>> >                                                     certain format. If
>>>> >                                                     you choose to
>>>> adhere
>>>> >                                                     to ECMA-404, then
>>>> it
>>>> >                                                     doesn't sound like
>>>> >                                                     JsonObject can
>>>> >                                                     represent an Avro
>>>> >                                                     object, if I'm
>>>> >                                                     reading it right;
>>>> >
>>>> >
>>>> >                                         It is in the generator impl,
>>>> you
>>>> >                                         can impl an avrogenerator.
>>>> >
>>>> >                                               * doesn't define a type
>>>> >                                                 system (JSON does, but
>>>> >                                                 it's lacking):
>>>> >                                                   o for example, JSON
>>>> >                                                     doesn't define
>>>> >                                                     semantics for
>>>> numbers;
>>>> >                                                   o doesn't define
>>>> >                                                     date/time types;
>>>> >                                                   o doesn't allow
>>>> >                                                     extending JSON
>>>> type
>>>> >                                                     system at all;
>>>> >
>>>> >
>>>> >                                         That is why you need a metada
>>>> >                                         object, or simpler, a schema
>>>> >                                         with that data. Json or beam
>>>> >                                         record doesnt help here and
>>>> you
>>>> >                                         end up on the same outcome if
>>>> >                                         you think about it.
>>>> >
>>>> >                                               * lacks schemas;
>>>> >
>>>> >                                         Jsonschema are standard,
>>>> widely
>>>> >                                         spread and tooled compared to
>>>> >                                         alternative.
>>>> >
>>>> >                                             You can definitely try
>>>> >                                             loosen the requirements
>>>> and
>>>> >                                             define everything in JSON
>>>> in
>>>> >                                             userland, but the point of
>>>> >                                             Row/Schema is to avoid it
>>>> >                                             and define everything in
>>>> >                                             Beam model, which can be
>>>> >                                             extended, mapped to JSON,
>>>> >                                             Avro, BigQuery Schemas,
>>>> >                                             custom binary format etc.,
>>>> >                                             with same semantics across
>>>> >                                             beam SDKs.
>>>> >
>>>> >
>>>> >                                         This is what jsonp would allow
>>>> >                                         with the benefit of a natural
>>>> >                                         pojo support through jsonb.
>>>> >
>>>> >
>>>> >
>>>> >                                             On Thu, Apr 26, 2018 at
>>>> >                                             12:28 PM Romain
>>>> Manni-Bucau
>>>> >                                             <rmannibucau@gmail.com
>>>> >                                             <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                             wrote:
>>>> >
>>>> >                                                 Just to let it be
>>>> clear
>>>> >                                                 and let me understand:
>>>> >                                                 how is BeamRecord
>>>> >                                                 different from a
>>>> >                                                 JsonObject which is an
>>>> >                                                 API without
>>>> >                                                 implementation (not
>>>> >                                                 event a json one
>>>> OOTB)?
>>>> >                                                 Advantage of json
>>>> *api*
>>>> >                                                 are indeed natural
>>>> >                                                 mapping (jsonb is
>>>> based
>>>> >                                                 on jsonp so no new
>>>> >                                                 binding to reinvent)
>>>> and
>>>> >                                                 simple serialization
>>>> >                                                 (json+gzip for ex, or
>>>> >                                                 avro if you want to be
>>>> >                                                 geeky).
>>>> >
>>>> >                                                 I fail to see the
>>>> point
>>>> >                                                 to rebuild an
>>>> ecosystem ATM.
>>>> >
>>>> >                                                 Le 26 avr. 2018 19:12,
>>>> >                                                 "Reuven Lax"
>>>> >                                                 <relax@google.com
>>>> >                                                 <mailto:
>>>> relax@google.com>>
>>>> >                                                 a écrit :
>>>> >
>>>> >                                                     Exactly what JB
>>>> >                                                     said. We will
>>>> write
>>>> >                                                     a generic
>>>> conversion
>>>> >                                                     from Avro (or
>>>> json)
>>>> >                                                     to Beam schemas,
>>>> >                                                     which will make
>>>> them
>>>> >                                                     work transparently
>>>> >                                                     with SQL. The plan
>>>> >                                                     is also to migrate
>>>> >                                                     Anton's work so
>>>> that
>>>> >                                                     POJOs works
>>>> >                                                     generically for
>>>> any
>>>> >                                                     schema.
>>>> >
>>>> >                                                     Reuven
>>>> >
>>>> >                                                     On Thu, Apr 26,
>>>> 2018
>>>> >                                                     at 1:17 AM
>>>> >                                                     Jean-Baptiste
>>>> Onofré
>>>> >                                                     <jb@nanthrax.net
>>>> >                                                     <mailto:
>>>> jb@nanthrax.net>>
>>>> >                                                     wrote:
>>>> >
>>>> >                                                         For now we
>>>> have
>>>> >                                                         a generic
>>>> schema
>>>> >                                                         interface.
>>>> >                                                         Json-b can be
>>>> an
>>>> >                                                         impl, avro
>>>> could
>>>> >                                                         be another
>>>> one.
>>>> >
>>>> >                                                         Regards
>>>> >                                                         JB
>>>> >                                                         Le 26 avr.
>>>> 2018,
>>>> >                                                         à 12:08,
>>>> Romain
>>>> >                                                         Manni-Bucau
>>>> >                                                         <
>>>> rmannibucau@gmail.com
>>>> >                                                         <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                                         a écrit:
>>>> >
>>>> >                                                             Hmm,
>>>> >
>>>> >                                                             avro has
>>>> >                                                             still the
>>>> >                                                             pitfalls
>>>> to
>>>> >                                                             have an
>>>> >
>>>>  uncontrolled
>>>> >                                                             stack
>>>> which
>>>> >                                                             brings way
>>>> >                                                             too much
>>>> >
>>>>  dependencies
>>>> >                                                             to be part
>>>> >                                                             of any
>>>> API,
>>>> >                                                             this is
>>>> why
>>>> >                                                             I
>>>> proposed a
>>>> >                                                             JSON-P
>>>> based
>>>> >                                                             API
>>>> >
>>>>  (JsonObject)
>>>> >                                                             with a
>>>> >                                                             custom
>>>> beam
>>>> >                                                             entry for
>>>> >                                                             some
>>>> >                                                             metadata
>>>> >                                                             (headers
>>>> "à
>>>> >                                                             la
>>>> Camel").
>>>> >
>>>> >
>>>> >                                                             Romain
>>>> >
>>>>  Manni-Bucau
>>>> >
>>>>  @rmannibucau
>>>> >                                                             <
>>>> https://twitter.com/rmannibucau>
>>>> >                                                             |   Blog
>>>> >                                                             <
>>>> https://rmannibucau.metawerx.net/> |
>>>> >                                                             Old Blog
>>>> >                                                             <
>>>> http://rmannibucau.wordpress.com>
>>>> >                                                             |  Github
>>>> >                                                             <
>>>> https://github.com/rmannibucau> |
>>>> >                                                             LinkedIn
>>>> >                                                             <
>>>> https://www.linkedin.com/in/rmannibucau> |
>>>> >                                                             Book
>>>> >                                                             <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >                                                             2018-04-26
>>>> >                                                             9:59
>>>> >                                                             GMT+02:00
>>>> >
>>>>  Jean-Baptiste Onofré
>>>> >                                                             <
>>>> jb@nanthrax.net
>>>> >                                                             <mailto:
>>>> jb@nanthrax.net>>:
>>>> >
>>>> >
>>>> >                                                                 Hi
>>>> Ismael
>>>> >
>>>> >                                                                 You
>>>> mean
>>>> >
>>>>  directly
>>>> >                                                                 in
>>>> Beam
>>>> >                                                                 SQL ?
>>>> >
>>>> >                                                                 That
>>>> >                                                                 will
>>>> be
>>>> >                                                                 part
>>>> of
>>>> >                                                                 schema
>>>> >
>>>>  support:
>>>> >
>>>>  generic
>>>> >                                                                 record
>>>> >                                                                 could
>>>> be
>>>> >                                                                 one of
>>>> >                                                                 the
>>>> >
>>>>  payload
>>>> >                                                                 with
>>>> >                                                                 across
>>>> >
>>>>  schema.
>>>> >
>>>> >
>>>>  Regards
>>>> >                                                                 JB
>>>> >                                                                 Le 26
>>>> >                                                                 avr.
>>>> >                                                                 2018,
>>>> à
>>>> >                                                                 11:39,
>>>> >
>>>>  "Ismaël
>>>> >
>>>>  Mejía" <
>>>> >
>>>> iemejia@gmail.com
>>>> >
>>>>  <ma...@gmail.com>>
>>>> >                                                                 a
>>>> écrit:
>>>> >
>>>> >
>>>>  Hello Anton,
>>>> >
>>>> >
>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>> >
>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>> >
>>>>  is a natural fit for this approach too.
>>>> >
>>>> >
>>>>  Regards,
>>>> >
>>>>  Ismaël
>>>> >
>>>> >
>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>> >
>>>>  <ma...@google.com>> wrote:
>>>> >
>>>> >
>>>>
>>>> >
>>>> >
>>>>    Hi,
>>>> >
>>>> >
>>>>    I want
>>>> >
>>>>    to
>>>> >
>>>>    highlight
>>>> >
>>>>    a couple
>>>> >
>>>>    of
>>>> >
>>>>    improvements
>>>> >
>>>>    to
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    we
>>>> >
>>>>    have
>>>> >
>>>>    been
>>>> >
>>>> >
>>>>    working
>>>> >
>>>>    on
>>>> >
>>>>    recently
>>>> >
>>>>    which
>>>> >
>>>>    are
>>>> >
>>>>    targeted
>>>> >
>>>>    to
>>>> >
>>>>    make
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    API
>>>> >
>>>>    easier
>>>> >
>>>>    to
>>>> >
>>>>    use.
>>>> >
>>>> >
>>>>    Specifically
>>>> >
>>>>    these
>>>> >
>>>>    features
>>>> >
>>>>    simplify
>>>> >
>>>>    conversion
>>>> >
>>>>    of
>>>> >
>>>>    Java
>>>> >
>>>>    Beans
>>>> >
>>>>    and
>>>> >
>>>>    JSON
>>>> >
>>>> >
>>>>    strings
>>>> >
>>>>    to
>>>> >
>>>>    Rows.
>>>> >
>>>> >
>>>> >
>>>>    Feel
>>>> >
>>>>    free
>>>> >
>>>>    to
>>>> >
>>>>    try
>>>> >
>>>>    this
>>>> >
>>>>    and
>>>> >
>>>>    send
>>>> >
>>>>    any
>>>> >
>>>>    bugs/comments/PRs
>>>> >
>>>>    my
>>>> >
>>>>    way.
>>>> >
>>>> >
>>>> >
>>>>    **Caveat:
>>>> >
>>>>    this
>>>> >
>>>>    is
>>>> >
>>>>    still
>>>> >
>>>>    work
>>>> >
>>>>    in
>>>> >
>>>>    progress,
>>>> >
>>>>    and
>>>> >
>>>>    has
>>>> >
>>>>    known
>>>> >
>>>>    bugs
>>>> >
>>>>    and
>>>> >
>>>>    incomplete
>>>> >
>>>> >
>>>>    features,
>>>> >
>>>>    see
>>>> >
>>>>    below
>>>> >
>>>>    for
>>>> >
>>>>    details.**
>>>> >
>>>> >
>>>> >
>>>>    Background
>>>> >
>>>> >
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    queries
>>>> >
>>>>    can
>>>> >
>>>>    only
>>>> >
>>>>    be
>>>> >
>>>>    applied
>>>> >
>>>>    to
>>>> >
>>>>    PCollection<Row>.
>>>> >
>>>>    This
>>>> >
>>>>    means
>>>> >
>>>>    that
>>>> >
>>>> >
>>>>    users
>>>> >
>>>>    need
>>>> >
>>>>    to
>>>> >
>>>>    convert
>>>> >
>>>>    whatever
>>>> >
>>>>    PCollection
>>>> >
>>>>    elements
>>>> >
>>>>    they
>>>> >
>>>>    have
>>>> >
>>>>    to
>>>> >
>>>>    Rows
>>>> >
>>>>    before
>>>> >
>>>> >
>>>>    querying
>>>> >
>>>>    them
>>>> >
>>>>    with
>>>> >
>>>>    SQL.
>>>> >
>>>>    This
>>>> >
>>>>    usually
>>>> >
>>>>    requires
>>>> >
>>>>
>>>
>>>

Re: Beam SQL Improvements

Posted by Kenneth Knowles <kl...@google.com>.
FWIW adding (and removing) things from the highly dynamic parameter list of
@ProcessElement is precisely what it is intended for.

Kenn

On Wed, May 23, 2018 at 2:31 PM Reuven Lax <re...@google.com> wrote:

> Romain, maybe it would be useful for us to find some time on slack. I'd
> like to understand your concerns. Also keep in mind that I'm tagging all
> these classes as Experimental for now, so we can definitely change these
> interfaces around if we decide they are not the best ones.
>
> Reuven
>
> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Why not extending ProcessContext to add the new remapped output? But
>> looks good (the part i dont like is that creating a new context each time a
>> new feature is added is hurting users. What when beam will add some
>> reactive support? ReactiveOutputReceiver?)
>>
>> Pipeline sounds the wrong storage since once distributed you serialized
>> the instances so kind of broke the lifecycle of the original instance and
>> have no real release/close hook on them anymore right? Not sure we can do
>> better than dofn/source embedded instances today.
>>
>>
>>
>>
>> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
>> écrit :
>>
>>>
>>>
>>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>>> écrit :
>>>
>>>> Hi,
>>>>
>>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>>
>>>> It would be easier for users.
>>>>
>>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>>> we do in Camel: Beam could check the source/destination "type" and check
>>>> in the map if there's a converter available. This map can be store as
>>>> part of the pipeline (as we do for filesystem registration).
>>>>
>>>
>>>
>>> It works in camel because it is not strongly typed, isnt it? So can
>>> require a beam new pipeline api.
>>>
>>> +1 for the explicit transform, if added to the pipeline api as coder it
>>> wouldnt break the fluent api:
>>>
>>> p.apply(io).setOutputType(Foo.class)
>>>
>>> Coders can be a workaround since they owns the type but since the
>>> pcollection is the real owner it is surely saner this way, no?
>>>
>>> Also it needs to ensure all converters are present before running the
>>> pipeline probably, no implicit environment converter support is probably
>>> good to start to avoid late surprises.
>>>
>>>
>>>
>>>> My $0.01
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>>> > How does it work on the pipeline side?
>>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>>> > API to work not erasing generics?
>>>> >
>>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>>> > SQL(row)->BigQuery(row)
>>>> >
>>>> > Side note unrelated to Row: if you add another registry maybe a
>>>> pretask
>>>> > is to ensure beam has a kind of singleton/context to avoid to
>>>> duplicate
>>>> > it or not track it properly. These kind of converters will need a
>>>> global
>>>> > close and not only per record in general:
>>>> > converter.init();converter.convert(row);....converter.destroy();,
>>>> > otherwise it easily leaks. This is why it can require some way to not
>>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>>> add
>>>> > it to setup/teardown pby, being more global would be nicer but is more
>>>> > challenging.
>>>> >
>>>> > Romain Manni-Bucau
>>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>>> > <http://rmannibucau.wordpress.com> | Github
>>>> > <https://github.com/rmannibucau> | LinkedIn
>>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>>> > <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>>> > <ma...@google.com>> a écrit :
>>>> >
>>>> >     No - the only modules we need to add to core are the ones we
>>>> choose
>>>> >     to add. For example, I will probably add a registration for
>>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>>> >     with schemas. However I will add that to the GCP module, so only
>>>> >     someone depending on that module need to pull in that dependency.
>>>> >     The Java ServiceLoader framework can be used by these modules to
>>>> >     register schemas for their types (we already do something similar
>>>> >     for FileSystem and for coders as well).
>>>> >
>>>> >     BTW, right now the conversion back and forth between Row objects
>>>> I'm
>>>> >     doing in the ByteBuddy generated bytecode that we generate in
>>>> order
>>>> >     to invoke DoFns.
>>>> >
>>>> >     Reuven
>>>> >
>>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>>> >
>>>> >         Hmm, the pluggability part is close to what I wanted to do
>>>> with
>>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>>> >         schema API)
>>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>>> just
>>>> >         trying to understand here)?
>>>> >         If so, how an IO can use as() with the type it expects? Doesnt
>>>> >         it lead to have a tons of  these modules at the end?
>>>> >
>>>> >         Romain Manni-Bucau
>>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>>> >         <http://rmannibucau.wordpress.com> | Github
>>>> >         <https://github.com/rmannibucau> | LinkedIn
>>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>>> >         <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>>> >         <ma...@google.com>> a écrit :
>>>> >
>>>> >             By the way Romain, if you have specific scenarios in mind
>>>> I
>>>> >             would love to hear them. I can try and guess what exactly
>>>> >             you would like to get out of schemas, but it would work
>>>> >             better if you gave me concrete scenarios that you would
>>>> like
>>>> >             to work.
>>>> >
>>>> >             Reuven
>>>> >
>>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>>> relax@google.com
>>>> >             <ma...@google.com>> wrote:
>>>> >
>>>> >                 Yeah, what I'm working on will help with IO. Basically
>>>> >                 if you register a function with SchemaRegistry that
>>>> >                 converts back and forth between a type (say
>>>> JsonObject)
>>>> >                 and a Beam Row, then it is applied by the framework
>>>> >                 behind the scenes as part of DoFn invocation. Concrete
>>>> >                 example: let's say I have an IO that reads json
>>>> objects
>>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>>> >                 JsonObject> {...}
>>>> >
>>>> >                 If you register a schema for this type (or you can
>>>> also
>>>> >                 just set the schema directly on the output
>>>> PCollection),
>>>> >                 then Beam knows how to convert back and forth between
>>>> >                 JsonObject and Row. So the next ParDo can look like
>>>> >
>>>> >                 p.apply(new MyJsonIORead())
>>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>> >                     @ProcessElement void process(@Element Row row) {
>>>> >                    })
>>>> >
>>>> >                 And Beam will automatically convert JsonObject to a
>>>> Row
>>>> >                 for processing (you aren't forced to do this of
>>>> course -
>>>> >                 you can always ask for it as a JsonObject).
>>>> >
>>>> >                 The same is true for output. If you have a sink that
>>>> >                 takes in JsonObject but the transform before it
>>>> produces
>>>> >                 Row objects (for instance - because the transform
>>>> before
>>>> >                 it is Beam SQL), Beam can automatically convert Row
>>>> back
>>>> >                 to JsonObject for you.
>>>> >
>>>> >                 All of this was detailed in the Schema doc I shared a
>>>> >                 few months ago. There was a lot of discussion on that
>>>> >                 document from various parties, and some of this API
>>>> is a
>>>> >                 result of that discussion. This is also working in the
>>>> >                 branch JB and I were working on, though not yet
>>>> >                 integrated back to master.
>>>> >
>>>> >                 I would like to actually go further and make Row an
>>>> >                 interface and provide a way to automatically put a Row
>>>> >                 interface on top of any other object (e.g. JsonObject,
>>>> >                 Pojo, etc.) This won't change the way the user writes
>>>> >                 code, but instead of Beam having to copy and convert
>>>> at
>>>> >                 each stage (e.g. from JsonObject to Row) it simply
>>>> will
>>>> >                 create a Row object that uses the the JsonObject as
>>>> its
>>>> >                 underlying storage.
>>>> >
>>>> >                 Reuven
>>>> >
>>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>>> >                 <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
>>>> >>
>>>> >                 wrote:
>>>> >
>>>> >                     Well, beam can implement a new mapper but it
>>>> doesnt
>>>> >                     help for io. Most of modern backends will take
>>>> json
>>>> >                     directly, even javax one and it must stay generic.
>>>> >
>>>> >                     Then since json to pojo mapping is already done a
>>>> >                     dozen of times, not sure it is worth it for now.
>>>> >
>>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>>> >                     <relax@google.com <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                         We can do even better btw. Building a
>>>> >                         SchemaRegistry where automatic conversions can
>>>> >                         be registered between schema and Java data
>>>> >                         types. With this the user won't even need a
>>>> DoFn
>>>> >                         to do the conversion.
>>>> >
>>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>>> >                         <ma...@gmail.com>> wrote:
>>>> >
>>>> >                             Hi guys,
>>>> >
>>>> >                             Checked out what has been done on schema
>>>> >                             model and think it is acceptable -
>>>> regarding
>>>> >                             the json debate -
>>>> >                             if
>>>> https://issues.apache.org/jira/browse/BEAM-4381
>>>> >                             can be fixed.
>>>> >
>>>> >                             High level, it is about providing a
>>>> >                             mainstream and not too impacting model
>>>> OOTB
>>>> >                             and JSON seems the most valid option for
>>>> >                             now, at least for IO and some user
>>>> transforms.
>>>> >
>>>> >                             Wdyt?
>>>> >
>>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>>> >                             <ma...@gmail.com>> a écrit :
>>>> >
>>>> >                                  Can give it a try end of may, sure.
>>>> >                                 (holidays and work constraints will
>>>> make
>>>> >                                 it hard before).
>>>> >
>>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>>> >                                 <kedin@google.com
>>>> >                                 <ma...@google.com>> a écrit :
>>>> >
>>>> >                                     Romain,
>>>> >
>>>> >                                     I don't believe that JSON approach
>>>> >                                     was investigated very thoroughIy.
>>>> I
>>>> >                                     mentioned few reasons which will
>>>> >                                     make it not the best choice my
>>>> >                                     opinion, but I may be wrong. Can
>>>> you
>>>> >                                     put together a design doc or a
>>>> >                                     prototype?
>>>> >
>>>> >                                     Thank you,
>>>> >                                     Anton
>>>> >
>>>> >
>>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>>> >                                     Romain Manni-Bucau
>>>> >                                     <rmannibucau@gmail.com
>>>> >                                     <ma...@gmail.com>>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>>> >                                         Kedin" <kedin@google.com
>>>> >                                         <ma...@google.com>> a
>>>> écrit :
>>>> >
>>>> >                                             BeamRecord (Row) has very
>>>> >                                             little in common with
>>>> >                                             JsonObject (I assume
>>>> you're
>>>> >                                             talking about javax.json),
>>>> >                                             except maybe some
>>>> >                                             similarities of the API.
>>>> Few
>>>> >                                             reasons why JsonObject
>>>> >                                             doesn't work:
>>>> >
>>>> >                                               * it is a Java EE API:
>>>> >                                                   o Beam SDK is not
>>>> >                                                     limited to Java.
>>>> >                                                     There are probably
>>>> >                                                     similar APIs for
>>>> >                                                     other languages
>>>> but
>>>> >                                                     they might not
>>>> >                                                     necessarily carry
>>>> >                                                     the same
>>>> semantics /
>>>> >                                                     APIs;
>>>> >
>>>> >
>>>> >                                         Not a big deal I think. At
>>>> least
>>>> >                                         not a technical blocker.
>>>> >
>>>> >                                                   o It can change
>>>> >                                                     between Java
>>>> versions;
>>>> >
>>>> >                                         No, this is javaee ;).
>>>> >
>>>> >
>>>> >                                                   o Current Beam java
>>>> >                                                     implementation is
>>>> an
>>>> >                                                     experimental
>>>> feature
>>>> >                                                     to identify what's
>>>> >                                                     needed from such
>>>> >                                                     API, in the end we
>>>> >                                                     might end up with
>>>> >                                                     something similar
>>>> to
>>>> >                                                     JsonObject API,
>>>> but
>>>> >                                                     likely not
>>>> >
>>>> >
>>>> >                                         I dont get that point as a
>>>> blocker
>>>> >
>>>> >                                                   o ;
>>>> >                                               * represents JSON, which
>>>> >                                                 is not an API but an
>>>> >                                                 object notation:
>>>> >                                                   o it is defined as
>>>> >                                                     unicode string in
>>>> a
>>>> >                                                     certain format. If
>>>> >                                                     you choose to
>>>> adhere
>>>> >                                                     to ECMA-404, then
>>>> it
>>>> >                                                     doesn't sound like
>>>> >                                                     JsonObject can
>>>> >                                                     represent an Avro
>>>> >                                                     object, if I'm
>>>> >                                                     reading it right;
>>>> >
>>>> >
>>>> >                                         It is in the generator impl,
>>>> you
>>>> >                                         can impl an avrogenerator.
>>>> >
>>>> >                                               * doesn't define a type
>>>> >                                                 system (JSON does, but
>>>> >                                                 it's lacking):
>>>> >                                                   o for example, JSON
>>>> >                                                     doesn't define
>>>> >                                                     semantics for
>>>> numbers;
>>>> >                                                   o doesn't define
>>>> >                                                     date/time types;
>>>> >                                                   o doesn't allow
>>>> >                                                     extending JSON
>>>> type
>>>> >                                                     system at all;
>>>> >
>>>> >
>>>> >                                         That is why you need a metada
>>>> >                                         object, or simpler, a schema
>>>> >                                         with that data. Json or beam
>>>> >                                         record doesnt help here and
>>>> you
>>>> >                                         end up on the same outcome if
>>>> >                                         you think about it.
>>>> >
>>>> >                                               * lacks schemas;
>>>> >
>>>> >                                         Jsonschema are standard,
>>>> widely
>>>> >                                         spread and tooled compared to
>>>> >                                         alternative.
>>>> >
>>>> >                                             You can definitely try
>>>> >                                             loosen the requirements
>>>> and
>>>> >                                             define everything in JSON
>>>> in
>>>> >                                             userland, but the point of
>>>> >                                             Row/Schema is to avoid it
>>>> >                                             and define everything in
>>>> >                                             Beam model, which can be
>>>> >                                             extended, mapped to JSON,
>>>> >                                             Avro, BigQuery Schemas,
>>>> >                                             custom binary format etc.,
>>>> >                                             with same semantics across
>>>> >                                             beam SDKs.
>>>> >
>>>> >
>>>> >                                         This is what jsonp would allow
>>>> >                                         with the benefit of a natural
>>>> >                                         pojo support through jsonb.
>>>> >
>>>> >
>>>> >
>>>> >                                             On Thu, Apr 26, 2018 at
>>>> >                                             12:28 PM Romain
>>>> Manni-Bucau
>>>> >                                             <rmannibucau@gmail.com
>>>> >                                             <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                             wrote:
>>>> >
>>>> >                                                 Just to let it be
>>>> clear
>>>> >                                                 and let me understand:
>>>> >                                                 how is BeamRecord
>>>> >                                                 different from a
>>>> >                                                 JsonObject which is an
>>>> >                                                 API without
>>>> >                                                 implementation (not
>>>> >                                                 event a json one
>>>> OOTB)?
>>>> >                                                 Advantage of json
>>>> *api*
>>>> >                                                 are indeed natural
>>>> >                                                 mapping (jsonb is
>>>> based
>>>> >                                                 on jsonp so no new
>>>> >                                                 binding to reinvent)
>>>> and
>>>> >                                                 simple serialization
>>>> >                                                 (json+gzip for ex, or
>>>> >                                                 avro if you want to be
>>>> >                                                 geeky).
>>>> >
>>>> >                                                 I fail to see the
>>>> point
>>>> >                                                 to rebuild an
>>>> ecosystem ATM.
>>>> >
>>>> >                                                 Le 26 avr. 2018 19:12,
>>>> >                                                 "Reuven Lax"
>>>> >                                                 <relax@google.com
>>>> >                                                 <mailto:
>>>> relax@google.com>>
>>>> >                                                 a écrit :
>>>> >
>>>> >                                                     Exactly what JB
>>>> >                                                     said. We will
>>>> write
>>>> >                                                     a generic
>>>> conversion
>>>> >                                                     from Avro (or
>>>> json)
>>>> >                                                     to Beam schemas,
>>>> >                                                     which will make
>>>> them
>>>> >                                                     work transparently
>>>> >                                                     with SQL. The plan
>>>> >                                                     is also to migrate
>>>> >                                                     Anton's work so
>>>> that
>>>> >                                                     POJOs works
>>>> >                                                     generically for
>>>> any
>>>> >                                                     schema.
>>>> >
>>>> >                                                     Reuven
>>>> >
>>>> >                                                     On Thu, Apr 26,
>>>> 2018
>>>> >                                                     at 1:17 AM
>>>> >                                                     Jean-Baptiste
>>>> Onofré
>>>> >                                                     <jb@nanthrax.net
>>>> >                                                     <mailto:
>>>> jb@nanthrax.net>>
>>>> >                                                     wrote:
>>>> >
>>>> >                                                         For now we
>>>> have
>>>> >                                                         a generic
>>>> schema
>>>> >                                                         interface.
>>>> >                                                         Json-b can be
>>>> an
>>>> >                                                         impl, avro
>>>> could
>>>> >                                                         be another
>>>> one.
>>>> >
>>>> >                                                         Regards
>>>> >                                                         JB
>>>> >                                                         Le 26 avr.
>>>> 2018,
>>>> >                                                         à 12:08,
>>>> Romain
>>>> >                                                         Manni-Bucau
>>>> >                                                         <
>>>> rmannibucau@gmail.com
>>>> >                                                         <mailto:
>>>> rmannibucau@gmail.com>>
>>>> >                                                         a écrit:
>>>> >
>>>> >                                                             Hmm,
>>>> >
>>>> >                                                             avro has
>>>> >                                                             still the
>>>> >                                                             pitfalls
>>>> to
>>>> >                                                             have an
>>>> >
>>>>  uncontrolled
>>>> >                                                             stack
>>>> which
>>>> >                                                             brings way
>>>> >                                                             too much
>>>> >
>>>>  dependencies
>>>> >                                                             to be part
>>>> >                                                             of any
>>>> API,
>>>> >                                                             this is
>>>> why
>>>> >                                                             I
>>>> proposed a
>>>> >                                                             JSON-P
>>>> based
>>>> >                                                             API
>>>> >
>>>>  (JsonObject)
>>>> >                                                             with a
>>>> >                                                             custom
>>>> beam
>>>> >                                                             entry for
>>>> >                                                             some
>>>> >                                                             metadata
>>>> >                                                             (headers
>>>> "à
>>>> >                                                             la
>>>> Camel").
>>>> >
>>>> >
>>>> >                                                             Romain
>>>> >
>>>>  Manni-Bucau
>>>> >
>>>>  @rmannibucau
>>>> >                                                             <
>>>> https://twitter.com/rmannibucau>
>>>> >                                                             |   Blog
>>>> >                                                             <
>>>> https://rmannibucau.metawerx.net/> |
>>>> >                                                             Old Blog
>>>> >                                                             <
>>>> http://rmannibucau.wordpress.com>
>>>> >                                                             |  Github
>>>> >                                                             <
>>>> https://github.com/rmannibucau> |
>>>> >                                                             LinkedIn
>>>> >                                                             <
>>>> https://www.linkedin.com/in/rmannibucau> |
>>>> >                                                             Book
>>>> >                                                             <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>> >
>>>> >
>>>> >                                                             2018-04-26
>>>> >                                                             9:59
>>>> >                                                             GMT+02:00
>>>> >
>>>>  Jean-Baptiste Onofré
>>>> >                                                             <
>>>> jb@nanthrax.net
>>>> >                                                             <mailto:
>>>> jb@nanthrax.net>>:
>>>> >
>>>> >
>>>> >                                                                 Hi
>>>> Ismael
>>>> >
>>>> >                                                                 You
>>>> mean
>>>> >
>>>>  directly
>>>> >                                                                 in
>>>> Beam
>>>> >                                                                 SQL ?
>>>> >
>>>> >                                                                 That
>>>> >                                                                 will
>>>> be
>>>> >                                                                 part
>>>> of
>>>> >                                                                 schema
>>>> >
>>>>  support:
>>>> >
>>>>  generic
>>>> >                                                                 record
>>>> >                                                                 could
>>>> be
>>>> >                                                                 one of
>>>> >                                                                 the
>>>> >
>>>>  payload
>>>> >                                                                 with
>>>> >                                                                 across
>>>> >
>>>>  schema.
>>>> >
>>>> >
>>>>  Regards
>>>> >                                                                 JB
>>>> >                                                                 Le 26
>>>> >                                                                 avr.
>>>> >                                                                 2018,
>>>> à
>>>> >                                                                 11:39,
>>>> >
>>>>  "Ismaël
>>>> >
>>>>  Mejía" <
>>>> >
>>>> iemejia@gmail.com
>>>> >
>>>>  <ma...@gmail.com>>
>>>> >                                                                 a
>>>> écrit:
>>>> >
>>>> >
>>>>  Hello Anton,
>>>> >
>>>> >
>>>>  Thanks for the descriptive email and the really useful work. Any plans
>>>> >
>>>>  to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>> >
>>>>  is a natural fit for this approach too.
>>>> >
>>>> >
>>>>  Regards,
>>>> >
>>>>  Ismaël
>>>> >
>>>> >
>>>>  On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>>> >
>>>>  <ma...@google.com>> wrote:
>>>> >
>>>> >
>>>>
>>>> >
>>>> >
>>>>    Hi,
>>>> >
>>>> >
>>>>    I want
>>>> >
>>>>    to
>>>> >
>>>>    highlight
>>>> >
>>>>    a couple
>>>> >
>>>>    of
>>>> >
>>>>    improvements
>>>> >
>>>>    to
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    we
>>>> >
>>>>    have
>>>> >
>>>>    been
>>>> >
>>>> >
>>>>    working
>>>> >
>>>>    on
>>>> >
>>>>    recently
>>>> >
>>>>    which
>>>> >
>>>>    are
>>>> >
>>>>    targeted
>>>> >
>>>>    to
>>>> >
>>>>    make
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    API
>>>> >
>>>>    easier
>>>> >
>>>>    to
>>>> >
>>>>    use.
>>>> >
>>>> >
>>>>    Specifically
>>>> >
>>>>    these
>>>> >
>>>>    features
>>>> >
>>>>    simplify
>>>> >
>>>>    conversion
>>>> >
>>>>    of
>>>> >
>>>>    Java
>>>> >
>>>>    Beans
>>>> >
>>>>    and
>>>> >
>>>>    JSON
>>>> >
>>>> >
>>>>    strings
>>>> >
>>>>    to
>>>> >
>>>>    Rows.
>>>> >
>>>> >
>>>> >
>>>>    Feel
>>>> >
>>>>    free
>>>> >
>>>>    to
>>>> >
>>>>    try
>>>> >
>>>>    this
>>>> >
>>>>    and
>>>> >
>>>>    send
>>>> >
>>>>    any
>>>> >
>>>>    bugs/comments/PRs
>>>> >
>>>>    my
>>>> >
>>>>    way.
>>>> >
>>>> >
>>>> >
>>>>    **Caveat:
>>>> >
>>>>    this
>>>> >
>>>>    is
>>>> >
>>>>    still
>>>> >
>>>>    work
>>>> >
>>>>    in
>>>> >
>>>>    progress,
>>>> >
>>>>    and
>>>> >
>>>>    has
>>>> >
>>>>    known
>>>> >
>>>>    bugs
>>>> >
>>>>    and
>>>> >
>>>>    incomplete
>>>> >
>>>> >
>>>>    features,
>>>> >
>>>>    see
>>>> >
>>>>    below
>>>> >
>>>>    for
>>>> >
>>>>    details.**
>>>> >
>>>> >
>>>> >
>>>>    Background
>>>> >
>>>> >
>>>> >
>>>>    Beam
>>>> >
>>>>    SQL
>>>> >
>>>>    queries
>>>> >
>>>>    can
>>>> >
>>>>    only
>>>> >
>>>>    be
>>>> >
>>>>    applied
>>>> >
>>>>    to
>>>> >
>>>>    PCollection<Row>.
>>>> >
>>>>    This
>>>> >
>>>>    means
>>>> >
>>>>    that
>>>> >
>>>> >
>>>>    users
>>>> >
>>>>    need
>>>> >
>>>>    to
>>>> >
>>>>    convert
>>>> >
>>>>    whatever
>>>> >
>>>>    PCollection
>>>> >
>>>>    elements
>>>> >
>>>>    they
>>>> >
>>>>    have
>>>> >
>>>>    to
>>>> >
>>>>    Rows
>>>> >
>>>>    before
>>>> >
>>>> >
>>>>    querying
>>>> >
>>>>    them
>>>> >
>>>>    with
>>>> >
>>>>    SQL.
>>>> >
>>>>    This
>>>> >
>>>>    usually
>>>> >
>>>>    requires
>>>> >
>>>>
>>>
>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Romain, maybe it would be useful for us to find some time on slack. I'd
like to understand your concerns. Also keep in mind that I'm tagging all
these classes as Experimental for now, so we can definitely change these
interfaces around if we decide they are not the best ones.

Reuven

On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Why not extending ProcessContext to add the new remapped output? But looks
> good (the part i dont like is that creating a new context each time a new
> feature is added is hurting users. What when beam will add some reactive
> support? ReactiveOutputReceiver?)
>
> Pipeline sounds the wrong storage since once distributed you serialized
> the instances so kind of broke the lifecycle of the original instance and
> have no real release/close hook on them anymore right? Not sure we can do
> better than dofn/source embedded instances today.
>
>
>
>
> Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
> écrit :
>
>>
>>
>> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
>> écrit :
>>
>>> Hi,
>>>
>>> IMHO, it would be better to have a explicit transform/IO as converter.
>>>
>>> It would be easier for users.
>>>
>>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>>> we do in Camel: Beam could check the source/destination "type" and check
>>> in the map if there's a converter available. This map can be store as
>>> part of the pipeline (as we do for filesystem registration).
>>>
>>
>>
>> It works in camel because it is not strongly typed, isnt it? So can
>> require a beam new pipeline api.
>>
>> +1 for the explicit transform, if added to the pipeline api as coder it
>> wouldnt break the fluent api:
>>
>> p.apply(io).setOutputType(Foo.class)
>>
>> Coders can be a workaround since they owns the type but since the
>> pcollection is the real owner it is surely saner this way, no?
>>
>> Also it needs to ensure all converters are present before running the
>> pipeline probably, no implicit environment converter support is probably
>> good to start to avoid late surprises.
>>
>>
>>
>>> My $0.01
>>>
>>> Regards
>>> JB
>>>
>>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>>> > How does it work on the pipeline side?
>>> > Do you generate these "virtual" IO at build time to enable the fluent
>>> > API to work not erasing generics?
>>> >
>>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>>> > SQL(row)->BigQuery(row)
>>> >
>>> > Side note unrelated to Row: if you add another registry maybe a pretask
>>> > is to ensure beam has a kind of singleton/context to avoid to duplicate
>>> > it or not track it properly. These kind of converters will need a
>>> global
>>> > close and not only per record in general:
>>> > converter.init();converter.convert(row);....converter.destroy();,
>>> > otherwise it easily leaks. This is why it can require some way to not
>>> > recreate it. A quick fix, if you are in bytebuddy already, can be to
>>> add
>>> > it to setup/teardown pby, being more global would be nicer but is more
>>> > challenging.
>>> >
>>> > Romain Manni-Bucau
>>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> > <https://rmannibucau.metawerx.net/> | Old Blog
>>> > <http://rmannibucau.wordpress.com> | Github
>>> > <https://github.com/rmannibucau> | LinkedIn
>>> > <https://www.linkedin.com/in/rmannibucau> | Book
>>> > <
>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>> >
>>> >
>>> >
>>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>>> > <ma...@google.com>> a écrit :
>>> >
>>> >     No - the only modules we need to add to core are the ones we choose
>>> >     to add. For example, I will probably add a registration for
>>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>>> >     with schemas. However I will add that to the GCP module, so only
>>> >     someone depending on that module need to pull in that dependency.
>>> >     The Java ServiceLoader framework can be used by these modules to
>>> >     register schemas for their types (we already do something similar
>>> >     for FileSystem and for coders as well).
>>> >
>>> >     BTW, right now the conversion back and forth between Row objects
>>> I'm
>>> >     doing in the ByteBuddy generated bytecode that we generate in order
>>> >     to invoke DoFns.
>>> >
>>> >     Reuven
>>> >
>>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>> >
>>> >         Hmm, the pluggability part is close to what I wanted to do with
>>> >         JsonObject as a main API (to avoid to redo a "row" API and
>>> >         schema API)
>>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>>> >         beam-sdk-java-row-jsonobject like modules (I'm not against,
>>> just
>>> >         trying to understand here)?
>>> >         If so, how an IO can use as() with the type it expects? Doesnt
>>> >         it lead to have a tons of  these modules at the end?
>>> >
>>> >         Romain Manni-Bucau
>>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>>> >         <http://rmannibucau.wordpress.com> | Github
>>> >         <https://github.com/rmannibucau> | LinkedIn
>>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>>> >         <
>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>> >
>>> >
>>> >
>>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>>> >         <ma...@google.com>> a écrit :
>>> >
>>> >             By the way Romain, if you have specific scenarios in mind I
>>> >             would love to hear them. I can try and guess what exactly
>>> >             you would like to get out of schemas, but it would work
>>> >             better if you gave me concrete scenarios that you would
>>> like
>>> >             to work.
>>> >
>>> >             Reuven
>>> >
>>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>>> relax@google.com
>>> >             <ma...@google.com>> wrote:
>>> >
>>> >                 Yeah, what I'm working on will help with IO. Basically
>>> >                 if you register a function with SchemaRegistry that
>>> >                 converts back and forth between a type (say JsonObject)
>>> >                 and a Beam Row, then it is applied by the framework
>>> >                 behind the scenes as part of DoFn invocation. Concrete
>>> >                 example: let's say I have an IO that reads json objects
>>> >                   class MyJsonIORead extends PTransform<PBegin,
>>> >                 JsonObject> {...}
>>> >
>>> >                 If you register a schema for this type (or you can also
>>> >                 just set the schema directly on the output
>>> PCollection),
>>> >                 then Beam knows how to convert back and forth between
>>> >                 JsonObject and Row. So the next ParDo can look like
>>> >
>>> >                 p.apply(new MyJsonIORead())
>>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>>> >                     @ProcessElement void process(@Element Row row) {
>>> >                    })
>>> >
>>> >                 And Beam will automatically convert JsonObject to a Row
>>> >                 for processing (you aren't forced to do this of course
>>> -
>>> >                 you can always ask for it as a JsonObject).
>>> >
>>> >                 The same is true for output. If you have a sink that
>>> >                 takes in JsonObject but the transform before it
>>> produces
>>> >                 Row objects (for instance - because the transform
>>> before
>>> >                 it is Beam SQL), Beam can automatically convert Row
>>> back
>>> >                 to JsonObject for you.
>>> >
>>> >                 All of this was detailed in the Schema doc I shared a
>>> >                 few months ago. There was a lot of discussion on that
>>> >                 document from various parties, and some of this API is
>>> a
>>> >                 result of that discussion. This is also working in the
>>> >                 branch JB and I were working on, though not yet
>>> >                 integrated back to master.
>>> >
>>> >                 I would like to actually go further and make Row an
>>> >                 interface and provide a way to automatically put a Row
>>> >                 interface on top of any other object (e.g. JsonObject,
>>> >                 Pojo, etc.) This won't change the way the user writes
>>> >                 code, but instead of Beam having to copy and convert at
>>> >                 each stage (e.g. from JsonObject to Row) it simply will
>>> >                 create a Row object that uses the the JsonObject as its
>>> >                 underlying storage.
>>> >
>>> >                 Reuven
>>> >
>>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>>> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
>>> >                 wrote:
>>> >
>>> >                     Well, beam can implement a new mapper but it doesnt
>>> >                     help for io. Most of modern backends will take json
>>> >                     directly, even javax one and it must stay generic.
>>> >
>>> >                     Then since json to pojo mapping is already done a
>>> >                     dozen of times, not sure it is worth it for now.
>>> >
>>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>>> >                     <relax@google.com <ma...@google.com>> a
>>> écrit :
>>> >
>>> >                         We can do even better btw. Building a
>>> >                         SchemaRegistry where automatic conversions can
>>> >                         be registered between schema and Java data
>>> >                         types. With this the user won't even need a
>>> DoFn
>>> >                         to do the conversion.
>>> >
>>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>>> >                         Manni-Bucau <rmannibucau@gmail.com
>>> >                         <ma...@gmail.com>> wrote:
>>> >
>>> >                             Hi guys,
>>> >
>>> >                             Checked out what has been done on schema
>>> >                             model and think it is acceptable -
>>> regarding
>>> >                             the json debate -
>>> >                             if
>>> https://issues.apache.org/jira/browse/BEAM-4381
>>> >                             can be fixed.
>>> >
>>> >                             High level, it is about providing a
>>> >                             mainstream and not too impacting model OOTB
>>> >                             and JSON seems the most valid option for
>>> >                             now, at least for IO and some user
>>> transforms.
>>> >
>>> >                             Wdyt?
>>> >
>>> >                             Le ven. 27 avr. 2018 18:36, Romain
>>> >                             Manni-Bucau <rmannibucau@gmail.com
>>> >                             <ma...@gmail.com>> a écrit :
>>> >
>>> >                                  Can give it a try end of may, sure.
>>> >                                 (holidays and work constraints will
>>> make
>>> >                                 it hard before).
>>> >
>>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>>> >                                 <kedin@google.com
>>> >                                 <ma...@google.com>> a écrit :
>>> >
>>> >                                     Romain,
>>> >
>>> >                                     I don't believe that JSON approach
>>> >                                     was investigated very thoroughIy. I
>>> >                                     mentioned few reasons which will
>>> >                                     make it not the best choice my
>>> >                                     opinion, but I may be wrong. Can
>>> you
>>> >                                     put together a design doc or a
>>> >                                     prototype?
>>> >
>>> >                                     Thank you,
>>> >                                     Anton
>>> >
>>> >
>>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>>> >                                     Romain Manni-Bucau
>>> >                                     <rmannibucau@gmail.com
>>> >                                     <ma...@gmail.com>>
>>> wrote:
>>> >
>>> >
>>> >
>>> >                                         Le 26 avr. 2018 23:13, "Anton
>>> >                                         Kedin" <kedin@google.com
>>> >                                         <ma...@google.com>> a
>>> écrit :
>>> >
>>> >                                             BeamRecord (Row) has very
>>> >                                             little in common with
>>> >                                             JsonObject (I assume you're
>>> >                                             talking about javax.json),
>>> >                                             except maybe some
>>> >                                             similarities of the API.
>>> Few
>>> >                                             reasons why JsonObject
>>> >                                             doesn't work:
>>> >
>>> >                                               * it is a Java EE API:
>>> >                                                   o Beam SDK is not
>>> >                                                     limited to Java.
>>> >                                                     There are probably
>>> >                                                     similar APIs for
>>> >                                                     other languages but
>>> >                                                     they might not
>>> >                                                     necessarily carry
>>> >                                                     the same semantics
>>> /
>>> >                                                     APIs;
>>> >
>>> >
>>> >                                         Not a big deal I think. At
>>> least
>>> >                                         not a technical blocker.
>>> >
>>> >                                                   o It can change
>>> >                                                     between Java
>>> versions;
>>> >
>>> >                                         No, this is javaee ;).
>>> >
>>> >
>>> >                                                   o Current Beam java
>>> >                                                     implementation is
>>> an
>>> >                                                     experimental
>>> feature
>>> >                                                     to identify what's
>>> >                                                     needed from such
>>> >                                                     API, in the end we
>>> >                                                     might end up with
>>> >                                                     something similar
>>> to
>>> >                                                     JsonObject API, but
>>> >                                                     likely not
>>> >
>>> >
>>> >                                         I dont get that point as a
>>> blocker
>>> >
>>> >                                                   o ;
>>> >                                               * represents JSON, which
>>> >                                                 is not an API but an
>>> >                                                 object notation:
>>> >                                                   o it is defined as
>>> >                                                     unicode string in a
>>> >                                                     certain format. If
>>> >                                                     you choose to
>>> adhere
>>> >                                                     to ECMA-404, then
>>> it
>>> >                                                     doesn't sound like
>>> >                                                     JsonObject can
>>> >                                                     represent an Avro
>>> >                                                     object, if I'm
>>> >                                                     reading it right;
>>> >
>>> >
>>> >                                         It is in the generator impl,
>>> you
>>> >                                         can impl an avrogenerator.
>>> >
>>> >                                               * doesn't define a type
>>> >                                                 system (JSON does, but
>>> >                                                 it's lacking):
>>> >                                                   o for example, JSON
>>> >                                                     doesn't define
>>> >                                                     semantics for
>>> numbers;
>>> >                                                   o doesn't define
>>> >                                                     date/time types;
>>> >                                                   o doesn't allow
>>> >                                                     extending JSON type
>>> >                                                     system at all;
>>> >
>>> >
>>> >                                         That is why you need a metada
>>> >                                         object, or simpler, a schema
>>> >                                         with that data. Json or beam
>>> >                                         record doesnt help here and you
>>> >                                         end up on the same outcome if
>>> >                                         you think about it.
>>> >
>>> >                                               * lacks schemas;
>>> >
>>> >                                         Jsonschema are standard, widely
>>> >                                         spread and tooled compared to
>>> >                                         alternative.
>>> >
>>> >                                             You can definitely try
>>> >                                             loosen the requirements and
>>> >                                             define everything in JSON
>>> in
>>> >                                             userland, but the point of
>>> >                                             Row/Schema is to avoid it
>>> >                                             and define everything in
>>> >                                             Beam model, which can be
>>> >                                             extended, mapped to JSON,
>>> >                                             Avro, BigQuery Schemas,
>>> >                                             custom binary format etc.,
>>> >                                             with same semantics across
>>> >                                             beam SDKs.
>>> >
>>> >
>>> >                                         This is what jsonp would allow
>>> >                                         with the benefit of a natural
>>> >                                         pojo support through jsonb.
>>> >
>>> >
>>> >
>>> >                                             On Thu, Apr 26, 2018 at
>>> >                                             12:28 PM Romain Manni-Bucau
>>> >                                             <rmannibucau@gmail.com
>>> >                                             <mailto:
>>> rmannibucau@gmail.com>>
>>> >                                             wrote:
>>> >
>>> >                                                 Just to let it be clear
>>> >                                                 and let me understand:
>>> >                                                 how is BeamRecord
>>> >                                                 different from a
>>> >                                                 JsonObject which is an
>>> >                                                 API without
>>> >                                                 implementation (not
>>> >                                                 event a json one OOTB)?
>>> >                                                 Advantage of json *api*
>>> >                                                 are indeed natural
>>> >                                                 mapping (jsonb is based
>>> >                                                 on jsonp so no new
>>> >                                                 binding to reinvent)
>>> and
>>> >                                                 simple serialization
>>> >                                                 (json+gzip for ex, or
>>> >                                                 avro if you want to be
>>> >                                                 geeky).
>>> >
>>> >                                                 I fail to see the point
>>> >                                                 to rebuild an
>>> ecosystem ATM.
>>> >
>>> >                                                 Le 26 avr. 2018 19:12,
>>> >                                                 "Reuven Lax"
>>> >                                                 <relax@google.com
>>> >                                                 <mailto:
>>> relax@google.com>>
>>> >                                                 a écrit :
>>> >
>>> >                                                     Exactly what JB
>>> >                                                     said. We will write
>>> >                                                     a generic
>>> conversion
>>> >                                                     from Avro (or json)
>>> >                                                     to Beam schemas,
>>> >                                                     which will make
>>> them
>>> >                                                     work transparently
>>> >                                                     with SQL. The plan
>>> >                                                     is also to migrate
>>> >                                                     Anton's work so
>>> that
>>> >                                                     POJOs works
>>> >                                                     generically for any
>>> >                                                     schema.
>>> >
>>> >                                                     Reuven
>>> >
>>> >                                                     On Thu, Apr 26,
>>> 2018
>>> >                                                     at 1:17 AM
>>> >                                                     Jean-Baptiste
>>> Onofré
>>> >                                                     <jb@nanthrax.net
>>> >                                                     <mailto:
>>> jb@nanthrax.net>>
>>> >                                                     wrote:
>>> >
>>> >                                                         For now we have
>>> >                                                         a generic
>>> schema
>>> >                                                         interface.
>>> >                                                         Json-b can be
>>> an
>>> >                                                         impl, avro
>>> could
>>> >                                                         be another one.
>>> >
>>> >                                                         Regards
>>> >                                                         JB
>>> >                                                         Le 26 avr.
>>> 2018,
>>> >                                                         à 12:08, Romain
>>> >                                                         Manni-Bucau
>>> >                                                         <
>>> rmannibucau@gmail.com
>>> >                                                         <mailto:
>>> rmannibucau@gmail.com>>
>>> >                                                         a écrit:
>>> >
>>> >                                                             Hmm,
>>> >
>>> >                                                             avro has
>>> >                                                             still the
>>> >                                                             pitfalls to
>>> >                                                             have an
>>> >
>>>  uncontrolled
>>> >                                                             stack which
>>> >                                                             brings way
>>> >                                                             too much
>>> >
>>>  dependencies
>>> >                                                             to be part
>>> >                                                             of any API,
>>> >                                                             this is why
>>> >                                                             I proposed
>>> a
>>> >                                                             JSON-P
>>> based
>>> >                                                             API
>>> >
>>>  (JsonObject)
>>> >                                                             with a
>>> >                                                             custom beam
>>> >                                                             entry for
>>> >                                                             some
>>> >                                                             metadata
>>> >                                                             (headers "à
>>> >                                                             la Camel").
>>> >
>>> >
>>> >                                                             Romain
>>> >                                                             Manni-Bucau
>>> >
>>>  @rmannibucau
>>> >                                                             <
>>> https://twitter.com/rmannibucau>
>>> >                                                             |   Blog
>>> >                                                             <
>>> https://rmannibucau.metawerx.net/> |
>>> >                                                             Old Blog
>>> >                                                             <
>>> http://rmannibucau.wordpress.com>
>>> >                                                             |  Github
>>> >                                                             <
>>> https://github.com/rmannibucau> |
>>> >                                                             LinkedIn
>>> >                                                             <
>>> https://www.linkedin.com/in/rmannibucau> |
>>> >                                                             Book
>>> >                                                             <
>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>> >
>>> >
>>> >
>>> >                                                             2018-04-26
>>> >                                                             9:59
>>> >                                                             GMT+02:00
>>> >
>>>  Jean-Baptiste Onofré
>>> >                                                             <
>>> jb@nanthrax.net
>>> >                                                             <mailto:
>>> jb@nanthrax.net>>:
>>> >
>>> >
>>> >                                                                 Hi
>>> Ismael
>>> >
>>> >                                                                 You
>>> mean
>>> >
>>>  directly
>>> >                                                                 in Beam
>>> >                                                                 SQL ?
>>> >
>>> >                                                                 That
>>> >                                                                 will be
>>> >                                                                 part of
>>> >                                                                 schema
>>> >
>>>  support:
>>> >                                                                 generic
>>> >                                                                 record
>>> >                                                                 could
>>> be
>>> >                                                                 one of
>>> >                                                                 the
>>> >                                                                 payload
>>> >                                                                 with
>>> >                                                                 across
>>> >                                                                 schema.
>>> >
>>> >                                                                 Regards
>>> >                                                                 JB
>>> >                                                                 Le 26
>>> >                                                                 avr.
>>> >                                                                 2018, à
>>> >                                                                 11:39,
>>> >                                                                 "Ismaël
>>> >                                                                 Mejía"
>>> <
>>> >
>>> iemejia@gmail.com
>>> >
>>>  <ma...@gmail.com>>
>>> >                                                                 a
>>> écrit:
>>> >
>>> >
>>>  Hello Anton,
>>> >
>>> >
>>>  Thanks for the descriptive email and the really useful work. Any plans
>>> >                                                                     to
>>> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>> >                                                                     is
>>> a natural fit for this approach too.
>>> >
>>> >
>>>  Regards,
>>> >
>>>  Ismaël
>>> >
>>> >                                                                     On
>>> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>>> >
>>>  <ma...@google.com>> wrote:
>>> >
>>> >
>>>
>>> >
>>> >
>>>  Hi,
>>> >
>>> >
>>>  I want
>>> >
>>>  to
>>> >
>>>  highlight
>>> >
>>>  a couple
>>> >
>>>  of
>>> >
>>>  improvements
>>> >
>>>  to
>>> >
>>>  Beam
>>> >
>>>  SQL
>>> >
>>>  we
>>> >
>>>  have
>>> >
>>>  been
>>> >
>>> >
>>>  working
>>> >
>>>  on
>>> >
>>>  recently
>>> >
>>>  which
>>> >
>>>  are
>>> >
>>>  targeted
>>> >
>>>  to
>>> >
>>>  make
>>> >
>>>  Beam
>>> >
>>>  SQL
>>> >
>>>  API
>>> >
>>>  easier
>>> >
>>>  to
>>> >
>>>  use.
>>> >
>>> >
>>>  Specifically
>>> >
>>>  these
>>> >
>>>  features
>>> >
>>>  simplify
>>> >
>>>  conversion
>>> >
>>>  of
>>> >
>>>  Java
>>> >
>>>  Beans
>>> >
>>>  and
>>> >
>>>  JSON
>>> >
>>> >
>>>  strings
>>> >
>>>  to
>>> >
>>>  Rows.
>>> >
>>> >
>>> >
>>>  Feel
>>> >
>>>  free
>>> >
>>>  to
>>> >
>>>  try
>>> >
>>>  this
>>> >
>>>  and
>>> >
>>>  send
>>> >
>>>  any
>>> >
>>>  bugs/comments/PRs
>>> >
>>>  my
>>> >
>>>  way.
>>> >
>>> >
>>> >
>>>  **Caveat:
>>> >
>>>  this
>>> >
>>>  is
>>> >
>>>  still
>>> >
>>>  work
>>> >
>>>  in
>>> >
>>>  progress,
>>> >
>>>  and
>>> >
>>>  has
>>> >
>>>  known
>>> >
>>>  bugs
>>> >
>>>  and
>>> >
>>>  incomplete
>>> >
>>> >
>>>  features,
>>> >
>>>  see
>>> >
>>>  below
>>> >
>>>  for
>>> >
>>>  details.**
>>> >
>>> >
>>> >
>>>  Background
>>> >
>>> >
>>> >
>>>  Beam
>>> >
>>>  SQL
>>> >
>>>  queries
>>> >
>>>  can
>>> >
>>>  only
>>> >
>>>  be
>>> >
>>>  applied
>>> >
>>>  to
>>> >
>>>  PCollection<Row>.
>>> >
>>>  This
>>> >
>>>  means
>>> >
>>>  that
>>> >
>>> >
>>>  users
>>> >
>>>  need
>>> >
>>>  to
>>> >
>>>  convert
>>> >
>>>  whatever
>>> >
>>>  PCollection
>>> >
>>>  elements
>>> >
>>>  they
>>> >
>>>  have
>>> >
>>>  to
>>> >
>>>  Rows
>>> >
>>>  before
>>> >
>>> >
>>>  querying
>>> >
>>>  them
>>> >
>>>  with
>>> >
>>>  SQL.
>>> >
>>>  This
>>> >
>>>  usually
>>> >
>>>  requires
>>> >
>>
>>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Why not extending ProcessContext to add the new remapped output? But looks
good (the part i dont like is that creating a new context each time a new
feature is added is hurting users. What when beam will add some reactive
support? ReactiveOutputReceiver?)

Pipeline sounds the wrong storage since once distributed you serialized the
instances so kind of broke the lifecycle of the original instance and have
no real release/close hook on them anymore right? Not sure we can do better
than dofn/source embedded instances today.




Le mer. 23 mai 2018 08:02, Romain Manni-Bucau <rm...@gmail.com> a
écrit :

>
>
> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
> écrit :
>
>> Hi,
>>
>> IMHO, it would be better to have a explicit transform/IO as converter.
>>
>> It would be easier for users.
>>
>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>> we do in Camel: Beam could check the source/destination "type" and check
>> in the map if there's a converter available. This map can be store as
>> part of the pipeline (as we do for filesystem registration).
>>
>
>
> It works in camel because it is not strongly typed, isnt it? So can
> require a beam new pipeline api.
>
> +1 for the explicit transform, if added to the pipeline api as coder it
> wouldnt break the fluent api:
>
> p.apply(io).setOutputType(Foo.class)
>
> Coders can be a workaround since they owns the type but since the
> pcollection is the real owner it is surely saner this way, no?
>
> Also it needs to ensure all converters are present before running the
> pipeline probably, no implicit environment converter support is probably
> good to start to avoid late surprises.
>
>
>
>> My $0.01
>>
>> Regards
>> JB
>>
>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>> > How does it work on the pipeline side?
>> > Do you generate these "virtual" IO at build time to enable the fluent
>> > API to work not erasing generics?
>> >
>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>> > SQL(row)->BigQuery(row)
>> >
>> > Side note unrelated to Row: if you add another registry maybe a pretask
>> > is to ensure beam has a kind of singleton/context to avoid to duplicate
>> > it or not track it properly. These kind of converters will need a global
>> > close and not only per record in general:
>> > converter.init();converter.convert(row);....converter.destroy();,
>> > otherwise it easily leaks. This is why it can require some way to not
>> > recreate it. A quick fix, if you are in bytebuddy already, can be to add
>> > it to setup/teardown pby, being more global would be nicer but is more
>> > challenging.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> > <https://rmannibucau.metawerx.net/> | Old Blog
>> > <http://rmannibucau.wordpress.com> | Github
>> > <https://github.com/rmannibucau> | LinkedIn
>> > <https://www.linkedin.com/in/rmannibucau> | Book
>> > <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>> > <ma...@google.com>> a écrit :
>> >
>> >     No - the only modules we need to add to core are the ones we choose
>> >     to add. For example, I will probably add a registration for
>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>> >     with schemas. However I will add that to the GCP module, so only
>> >     someone depending on that module need to pull in that dependency.
>> >     The Java ServiceLoader framework can be used by these modules to
>> >     register schemas for their types (we already do something similar
>> >     for FileSystem and for coders as well).
>> >
>> >     BTW, right now the conversion back and forth between Row objects I'm
>> >     doing in the ByteBuddy generated bytecode that we generate in order
>> >     to invoke DoFns.
>> >
>> >     Reuven
>> >
>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>> >
>> >         Hmm, the pluggability part is close to what I wanted to do with
>> >         JsonObject as a main API (to avoid to redo a "row" API and
>> >         schema API)
>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>> >         beam-sdk-java-row-jsonobject like modules (I'm not against, just
>> >         trying to understand here)?
>> >         If so, how an IO can use as() with the type it expects? Doesnt
>> >         it lead to have a tons of  these modules at the end?
>> >
>> >         Romain Manni-Bucau
>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>> >         <http://rmannibucau.wordpress.com> | Github
>> >         <https://github.com/rmannibucau> | LinkedIn
>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>> >         <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>> >         <ma...@google.com>> a écrit :
>> >
>> >             By the way Romain, if you have specific scenarios in mind I
>> >             would love to hear them. I can try and guess what exactly
>> >             you would like to get out of schemas, but it would work
>> >             better if you gave me concrete scenarios that you would like
>> >             to work.
>> >
>> >             Reuven
>> >
>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>> relax@google.com
>> >             <ma...@google.com>> wrote:
>> >
>> >                 Yeah, what I'm working on will help with IO. Basically
>> >                 if you register a function with SchemaRegistry that
>> >                 converts back and forth between a type (say JsonObject)
>> >                 and a Beam Row, then it is applied by the framework
>> >                 behind the scenes as part of DoFn invocation. Concrete
>> >                 example: let's say I have an IO that reads json objects
>> >                   class MyJsonIORead extends PTransform<PBegin,
>> >                 JsonObject> {...}
>> >
>> >                 If you register a schema for this type (or you can also
>> >                 just set the schema directly on the output PCollection),
>> >                 then Beam knows how to convert back and forth between
>> >                 JsonObject and Row. So the next ParDo can look like
>> >
>> >                 p.apply(new MyJsonIORead())
>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>> >                     @ProcessElement void process(@Element Row row) {
>> >                    })
>> >
>> >                 And Beam will automatically convert JsonObject to a Row
>> >                 for processing (you aren't forced to do this of course -
>> >                 you can always ask for it as a JsonObject).
>> >
>> >                 The same is true for output. If you have a sink that
>> >                 takes in JsonObject but the transform before it produces
>> >                 Row objects (for instance - because the transform before
>> >                 it is Beam SQL), Beam can automatically convert Row back
>> >                 to JsonObject for you.
>> >
>> >                 All of this was detailed in the Schema doc I shared a
>> >                 few months ago. There was a lot of discussion on that
>> >                 document from various parties, and some of this API is a
>> >                 result of that discussion. This is also working in the
>> >                 branch JB and I were working on, though not yet
>> >                 integrated back to master.
>> >
>> >                 I would like to actually go further and make Row an
>> >                 interface and provide a way to automatically put a Row
>> >                 interface on top of any other object (e.g. JsonObject,
>> >                 Pojo, etc.) This won't change the way the user writes
>> >                 code, but instead of Beam having to copy and convert at
>> >                 each stage (e.g. from JsonObject to Row) it simply will
>> >                 create a Row object that uses the the JsonObject as its
>> >                 underlying storage.
>> >
>> >                 Reuven
>> >
>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
>> >                 wrote:
>> >
>> >                     Well, beam can implement a new mapper but it doesnt
>> >                     help for io. Most of modern backends will take json
>> >                     directly, even javax one and it must stay generic.
>> >
>> >                     Then since json to pojo mapping is already done a
>> >                     dozen of times, not sure it is worth it for now.
>> >
>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>> >                     <relax@google.com <ma...@google.com>> a
>> écrit :
>> >
>> >                         We can do even better btw. Building a
>> >                         SchemaRegistry where automatic conversions can
>> >                         be registered between schema and Java data
>> >                         types. With this the user won't even need a DoFn
>> >                         to do the conversion.
>> >
>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>> >                         Manni-Bucau <rmannibucau@gmail.com
>> >                         <ma...@gmail.com>> wrote:
>> >
>> >                             Hi guys,
>> >
>> >                             Checked out what has been done on schema
>> >                             model and think it is acceptable - regarding
>> >                             the json debate -
>> >                             if
>> https://issues.apache.org/jira/browse/BEAM-4381
>> >                             can be fixed.
>> >
>> >                             High level, it is about providing a
>> >                             mainstream and not too impacting model OOTB
>> >                             and JSON seems the most valid option for
>> >                             now, at least for IO and some user
>> transforms.
>> >
>> >                             Wdyt?
>> >
>> >                             Le ven. 27 avr. 2018 18:36, Romain
>> >                             Manni-Bucau <rmannibucau@gmail.com
>> >                             <ma...@gmail.com>> a écrit :
>> >
>> >                                  Can give it a try end of may, sure.
>> >                                 (holidays and work constraints will make
>> >                                 it hard before).
>> >
>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>> >                                 <kedin@google.com
>> >                                 <ma...@google.com>> a écrit :
>> >
>> >                                     Romain,
>> >
>> >                                     I don't believe that JSON approach
>> >                                     was investigated very thoroughIy. I
>> >                                     mentioned few reasons which will
>> >                                     make it not the best choice my
>> >                                     opinion, but I may be wrong. Can you
>> >                                     put together a design doc or a
>> >                                     prototype?
>> >
>> >                                     Thank you,
>> >                                     Anton
>> >
>> >
>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>> >                                     Romain Manni-Bucau
>> >                                     <rmannibucau@gmail.com
>> >                                     <ma...@gmail.com>>
>> wrote:
>> >
>> >
>> >
>> >                                         Le 26 avr. 2018 23:13, "Anton
>> >                                         Kedin" <kedin@google.com
>> >                                         <ma...@google.com>> a
>> écrit :
>> >
>> >                                             BeamRecord (Row) has very
>> >                                             little in common with
>> >                                             JsonObject (I assume you're
>> >                                             talking about javax.json),
>> >                                             except maybe some
>> >                                             similarities of the API. Few
>> >                                             reasons why JsonObject
>> >                                             doesn't work:
>> >
>> >                                               * it is a Java EE API:
>> >                                                   o Beam SDK is not
>> >                                                     limited to Java.
>> >                                                     There are probably
>> >                                                     similar APIs for
>> >                                                     other languages but
>> >                                                     they might not
>> >                                                     necessarily carry
>> >                                                     the same semantics /
>> >                                                     APIs;
>> >
>> >
>> >                                         Not a big deal I think. At least
>> >                                         not a technical blocker.
>> >
>> >                                                   o It can change
>> >                                                     between Java
>> versions;
>> >
>> >                                         No, this is javaee ;).
>> >
>> >
>> >                                                   o Current Beam java
>> >                                                     implementation is an
>> >                                                     experimental feature
>> >                                                     to identify what's
>> >                                                     needed from such
>> >                                                     API, in the end we
>> >                                                     might end up with
>> >                                                     something similar to
>> >                                                     JsonObject API, but
>> >                                                     likely not
>> >
>> >
>> >                                         I dont get that point as a
>> blocker
>> >
>> >                                                   o ;
>> >                                               * represents JSON, which
>> >                                                 is not an API but an
>> >                                                 object notation:
>> >                                                   o it is defined as
>> >                                                     unicode string in a
>> >                                                     certain format. If
>> >                                                     you choose to adhere
>> >                                                     to ECMA-404, then it
>> >                                                     doesn't sound like
>> >                                                     JsonObject can
>> >                                                     represent an Avro
>> >                                                     object, if I'm
>> >                                                     reading it right;
>> >
>> >
>> >                                         It is in the generator impl, you
>> >                                         can impl an avrogenerator.
>> >
>> >                                               * doesn't define a type
>> >                                                 system (JSON does, but
>> >                                                 it's lacking):
>> >                                                   o for example, JSON
>> >                                                     doesn't define
>> >                                                     semantics for
>> numbers;
>> >                                                   o doesn't define
>> >                                                     date/time types;
>> >                                                   o doesn't allow
>> >                                                     extending JSON type
>> >                                                     system at all;
>> >
>> >
>> >                                         That is why you need a metada
>> >                                         object, or simpler, a schema
>> >                                         with that data. Json or beam
>> >                                         record doesnt help here and you
>> >                                         end up on the same outcome if
>> >                                         you think about it.
>> >
>> >                                               * lacks schemas;
>> >
>> >                                         Jsonschema are standard, widely
>> >                                         spread and tooled compared to
>> >                                         alternative.
>> >
>> >                                             You can definitely try
>> >                                             loosen the requirements and
>> >                                             define everything in JSON in
>> >                                             userland, but the point of
>> >                                             Row/Schema is to avoid it
>> >                                             and define everything in
>> >                                             Beam model, which can be
>> >                                             extended, mapped to JSON,
>> >                                             Avro, BigQuery Schemas,
>> >                                             custom binary format etc.,
>> >                                             with same semantics across
>> >                                             beam SDKs.
>> >
>> >
>> >                                         This is what jsonp would allow
>> >                                         with the benefit of a natural
>> >                                         pojo support through jsonb.
>> >
>> >
>> >
>> >                                             On Thu, Apr 26, 2018 at
>> >                                             12:28 PM Romain Manni-Bucau
>> >                                             <rmannibucau@gmail.com
>> >                                             <mailto:
>> rmannibucau@gmail.com>>
>> >                                             wrote:
>> >
>> >                                                 Just to let it be clear
>> >                                                 and let me understand:
>> >                                                 how is BeamRecord
>> >                                                 different from a
>> >                                                 JsonObject which is an
>> >                                                 API without
>> >                                                 implementation (not
>> >                                                 event a json one OOTB)?
>> >                                                 Advantage of json *api*
>> >                                                 are indeed natural
>> >                                                 mapping (jsonb is based
>> >                                                 on jsonp so no new
>> >                                                 binding to reinvent) and
>> >                                                 simple serialization
>> >                                                 (json+gzip for ex, or
>> >                                                 avro if you want to be
>> >                                                 geeky).
>> >
>> >                                                 I fail to see the point
>> >                                                 to rebuild an ecosystem
>> ATM.
>> >
>> >                                                 Le 26 avr. 2018 19:12,
>> >                                                 "Reuven Lax"
>> >                                                 <relax@google.com
>> >                                                 <mailto:
>> relax@google.com>>
>> >                                                 a écrit :
>> >
>> >                                                     Exactly what JB
>> >                                                     said. We will write
>> >                                                     a generic conversion
>> >                                                     from Avro (or json)
>> >                                                     to Beam schemas,
>> >                                                     which will make them
>> >                                                     work transparently
>> >                                                     with SQL. The plan
>> >                                                     is also to migrate
>> >                                                     Anton's work so that
>> >                                                     POJOs works
>> >                                                     generically for any
>> >                                                     schema.
>> >
>> >                                                     Reuven
>> >
>> >                                                     On Thu, Apr 26, 2018
>> >                                                     at 1:17 AM
>> >                                                     Jean-Baptiste Onofré
>> >                                                     <jb@nanthrax.net
>> >                                                     <mailto:
>> jb@nanthrax.net>>
>> >                                                     wrote:
>> >
>> >                                                         For now we have
>> >                                                         a generic schema
>> >                                                         interface.
>> >                                                         Json-b can be an
>> >                                                         impl, avro could
>> >                                                         be another one.
>> >
>> >                                                         Regards
>> >                                                         JB
>> >                                                         Le 26 avr. 2018,
>> >                                                         à 12:08, Romain
>> >                                                         Manni-Bucau
>> >                                                         <
>> rmannibucau@gmail.com
>> >                                                         <mailto:
>> rmannibucau@gmail.com>>
>> >                                                         a écrit:
>> >
>> >                                                             Hmm,
>> >
>> >                                                             avro has
>> >                                                             still the
>> >                                                             pitfalls to
>> >                                                             have an
>> >                                                             uncontrolled
>> >                                                             stack which
>> >                                                             brings way
>> >                                                             too much
>> >                                                             dependencies
>> >                                                             to be part
>> >                                                             of any API,
>> >                                                             this is why
>> >                                                             I proposed a
>> >                                                             JSON-P based
>> >                                                             API
>> >                                                             (JsonObject)
>> >                                                             with a
>> >                                                             custom beam
>> >                                                             entry for
>> >                                                             some
>> >                                                             metadata
>> >                                                             (headers "à
>> >                                                             la Camel").
>> >
>> >
>> >                                                             Romain
>> >                                                             Manni-Bucau
>> >                                                             @rmannibucau
>> >                                                             <
>> https://twitter.com/rmannibucau>
>> >                                                             |   Blog
>> >                                                             <
>> https://rmannibucau.metawerx.net/> |
>> >                                                             Old Blog
>> >                                                             <
>> http://rmannibucau.wordpress.com>
>> >                                                             |  Github
>> >                                                             <
>> https://github.com/rmannibucau> |
>> >                                                             LinkedIn
>> >                                                             <
>> https://www.linkedin.com/in/rmannibucau> |
>> >                                                             Book
>> >                                                             <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> >                                                             2018-04-26
>> >                                                             9:59
>> >                                                             GMT+02:00
>> >
>>  Jean-Baptiste Onofré
>> >                                                             <
>> jb@nanthrax.net
>> >                                                             <mailto:
>> jb@nanthrax.net>>:
>> >
>> >
>> >                                                                 Hi
>> Ismael
>> >
>> >                                                                 You mean
>> >                                                                 directly
>> >                                                                 in Beam
>> >                                                                 SQL ?
>> >
>> >                                                                 That
>> >                                                                 will be
>> >                                                                 part of
>> >                                                                 schema
>> >                                                                 support:
>> >                                                                 generic
>> >                                                                 record
>> >                                                                 could be
>> >                                                                 one of
>> >                                                                 the
>> >                                                                 payload
>> >                                                                 with
>> >                                                                 across
>> >                                                                 schema.
>> >
>> >                                                                 Regards
>> >                                                                 JB
>> >                                                                 Le 26
>> >                                                                 avr.
>> >                                                                 2018, à
>> >                                                                 11:39,
>> >                                                                 "Ismaël
>> >                                                                 Mejía" <
>> >
>> iemejia@gmail.com
>> >                                                                 <mailto:
>> iemejia@gmail.com>>
>> >                                                                 a écrit:
>> >
>> >
>>  Hello Anton,
>> >
>> >
>>  Thanks for the descriptive email and the really useful work. Any plans
>> >                                                                     to
>> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>> >                                                                     is
>> a natural fit for this approach too.
>> >
>> >
>>  Regards,
>> >
>>  Ismaël
>> >
>> >                                                                     On
>> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>> >
>>  <ma...@google.com>> wrote:
>> >
>> >
>>
>> >
>> >
>>  Hi,
>> >
>> >
>>  I want
>> >
>>  to
>> >
>>  highlight
>> >
>>  a couple
>> >
>>  of
>> >
>>  improvements
>> >
>>  to
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  we
>> >
>>  have
>> >
>>  been
>> >
>> >
>>  working
>> >
>>  on
>> >
>>  recently
>> >
>>  which
>> >
>>  are
>> >
>>  targeted
>> >
>>  to
>> >
>>  make
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  API
>> >
>>  easier
>> >
>>  to
>> >
>>  use.
>> >
>> >
>>  Specifically
>> >
>>  these
>> >
>>  features
>> >
>>  simplify
>> >
>>  conversion
>> >
>>  of
>> >
>>  Java
>> >
>>  Beans
>> >
>>  and
>> >
>>  JSON
>> >
>> >
>>  strings
>> >
>>  to
>> >
>>  Rows.
>> >
>> >
>> >
>>  Feel
>> >
>>  free
>> >
>>  to
>> >
>>  try
>> >
>>  this
>> >
>>  and
>> >
>>  send
>> >
>>  any
>> >
>>  bugs/comments/PRs
>> >
>>  my
>> >
>>  way.
>> >
>> >
>> >
>>  **Caveat:
>> >
>>  this
>> >
>>  is
>> >
>>  still
>> >
>>  work
>> >
>>  in
>> >
>>  progress,
>> >
>>  and
>> >
>>  has
>> >
>>  known
>> >
>>  bugs
>> >
>>  and
>> >
>>  incomplete
>> >
>> >
>>  features,
>> >
>>  see
>> >
>>  below
>> >
>>  for
>> >
>>  details.**
>> >
>> >
>> >
>>  Background
>> >
>> >
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  queries
>> >
>>  can
>> >
>>  only
>> >
>>  be
>> >
>>  applied
>> >
>>  to
>> >
>>  PCollection<Row>.
>> >
>>  This
>> >
>>  means
>> >
>>  that
>> >
>> >
>>  users
>> >
>>  need
>> >
>>  to
>> >
>>  convert
>> >
>>  whatever
>> >
>>  PCollection
>> >
>>  elements
>> >
>>  they
>> >
>>  have
>> >
>>  to
>> >
>>  Rows
>> >
>>  before
>> >
>> >
>>  querying
>> >
>>  them
>> >
>>  with
>> >
>>  SQL.
>> >
>>  This
>> >
>>  usually
>> >
>>  requires
>> >
>
>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Yeah, all schemas are verified when the pipeline is construct (before
anything starts running). BTW - under the covers schemas are implemented as
a special type of coder, and coders are always set on a PCollection.

I'm happy to add explicit conversion transforms as well for Beam users,
though as I mentioned generic transforms and frameworks like SQL will
probably not find it convenient to use them.


On Tue, May 22, 2018 at 11:02 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
>
> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a
> écrit :
>
>> Hi,
>>
>> IMHO, it would be better to have a explicit transform/IO as converter.
>>
>> It would be easier for users.
>>
>> Another option would be to use a "TypeConverter/SchemaConverter" map as
>> we do in Camel: Beam could check the source/destination "type" and check
>> in the map if there's a converter available. This map can be store as
>> part of the pipeline (as we do for filesystem registration).
>>
>
>
> It works in camel because it is not strongly typed, isnt it? So can
> require a beam new pipeline api.
>
> +1 for the explicit transform, if added to the pipeline api as coder it
> wouldnt break the fluent api:
>
> p.apply(io).setOutputType(Foo.class)
>
> Coders can be a workaround since they owns the type but since the
> pcollection is the real owner it is surely saner this way, no?
>
> Also it needs to ensure all converters are present before running the
> pipeline probably, no implicit environment converter support is probably
> good to start to avoid late surprises.
>
>
>
>> My $0.01
>>
>> Regards
>> JB
>>
>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>> > How does it work on the pipeline side?
>> > Do you generate these "virtual" IO at build time to enable the fluent
>> > API to work not erasing generics?
>> >
>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>> > SQL(row)->BigQuery(row)
>> >
>> > Side note unrelated to Row: if you add another registry maybe a pretask
>> > is to ensure beam has a kind of singleton/context to avoid to duplicate
>> > it or not track it properly. These kind of converters will need a global
>> > close and not only per record in general:
>> > converter.init();converter.convert(row);....converter.destroy();,
>> > otherwise it easily leaks. This is why it can require some way to not
>> > recreate it. A quick fix, if you are in bytebuddy already, can be to add
>> > it to setup/teardown pby, being more global would be nicer but is more
>> > challenging.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> > <https://rmannibucau.metawerx.net/> | Old Blog
>> > <http://rmannibucau.wordpress.com> | Github
>> > <https://github.com/rmannibucau> | LinkedIn
>> > <https://www.linkedin.com/in/rmannibucau> | Book
>> > <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
>> > <ma...@google.com>> a écrit :
>> >
>> >     No - the only modules we need to add to core are the ones we choose
>> >     to add. For example, I will probably add a registration for
>> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>> >     with schemas. However I will add that to the GCP module, so only
>> >     someone depending on that module need to pull in that dependency.
>> >     The Java ServiceLoader framework can be used by these modules to
>> >     register schemas for their types (we already do something similar
>> >     for FileSystem and for coders as well).
>> >
>> >     BTW, right now the conversion back and forth between Row objects I'm
>> >     doing in the ByteBuddy generated bytecode that we generate in order
>> >     to invoke DoFns.
>> >
>> >     Reuven
>> >
>> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>> >
>> >         Hmm, the pluggability part is close to what I wanted to do with
>> >         JsonObject as a main API (to avoid to redo a "row" API and
>> >         schema API)
>> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
>> >         beam-sdk-java-row-jsonobject like modules (I'm not against, just
>> >         trying to understand here)?
>> >         If so, how an IO can use as() with the type it expects? Doesnt
>> >         it lead to have a tons of  these modules at the end?
>> >
>> >         Romain Manni-Bucau
>> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> >         <https://rmannibucau.metawerx.net/> | Old Blog
>> >         <http://rmannibucau.wordpress.com> | Github
>> >         <https://github.com/rmannibucau> | LinkedIn
>> >         <https://www.linkedin.com/in/rmannibucau> | Book
>> >         <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>> >         <ma...@google.com>> a écrit :
>> >
>> >             By the way Romain, if you have specific scenarios in mind I
>> >             would love to hear them. I can try and guess what exactly
>> >             you would like to get out of schemas, but it would work
>> >             better if you gave me concrete scenarios that you would like
>> >             to work.
>> >
>> >             Reuven
>> >
>> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <
>> relax@google.com
>> >             <ma...@google.com>> wrote:
>> >
>> >                 Yeah, what I'm working on will help with IO. Basically
>> >                 if you register a function with SchemaRegistry that
>> >                 converts back and forth between a type (say JsonObject)
>> >                 and a Beam Row, then it is applied by the framework
>> >                 behind the scenes as part of DoFn invocation. Concrete
>> >                 example: let's say I have an IO that reads json objects
>> >                   class MyJsonIORead extends PTransform<PBegin,
>> >                 JsonObject> {...}
>> >
>> >                 If you register a schema for this type (or you can also
>> >                 just set the schema directly on the output PCollection),
>> >                 then Beam knows how to convert back and forth between
>> >                 JsonObject and Row. So the next ParDo can look like
>> >
>> >                 p.apply(new MyJsonIORead())
>> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>> >                     @ProcessElement void process(@Element Row row) {
>> >                    })
>> >
>> >                 And Beam will automatically convert JsonObject to a Row
>> >                 for processing (you aren't forced to do this of course -
>> >                 you can always ask for it as a JsonObject).
>> >
>> >                 The same is true for output. If you have a sink that
>> >                 takes in JsonObject but the transform before it produces
>> >                 Row objects (for instance - because the transform before
>> >                 it is Beam SQL), Beam can automatically convert Row back
>> >                 to JsonObject for you.
>> >
>> >                 All of this was detailed in the Schema doc I shared a
>> >                 few months ago. There was a lot of discussion on that
>> >                 document from various parties, and some of this API is a
>> >                 result of that discussion. This is also working in the
>> >                 branch JB and I were working on, though not yet
>> >                 integrated back to master.
>> >
>> >                 I would like to actually go further and make Row an
>> >                 interface and provide a way to automatically put a Row
>> >                 interface on top of any other object (e.g. JsonObject,
>> >                 Pojo, etc.) This won't change the way the user writes
>> >                 code, but instead of Beam having to copy and convert at
>> >                 each stage (e.g. from JsonObject to Row) it simply will
>> >                 create a Row object that uses the the JsonObject as its
>> >                 underlying storage.
>> >
>> >                 Reuven
>> >
>> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
>> >                 wrote:
>> >
>> >                     Well, beam can implement a new mapper but it doesnt
>> >                     help for io. Most of modern backends will take json
>> >                     directly, even javax one and it must stay generic.
>> >
>> >                     Then since json to pojo mapping is already done a
>> >                     dozen of times, not sure it is worth it for now.
>> >
>> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
>> >                     <relax@google.com <ma...@google.com>> a
>> écrit :
>> >
>> >                         We can do even better btw. Building a
>> >                         SchemaRegistry where automatic conversions can
>> >                         be registered between schema and Java data
>> >                         types. With this the user won't even need a DoFn
>> >                         to do the conversion.
>> >
>> >                         On Tue, May 22, 2018, 10:13 AM Romain
>> >                         Manni-Bucau <rmannibucau@gmail.com
>> >                         <ma...@gmail.com>> wrote:
>> >
>> >                             Hi guys,
>> >
>> >                             Checked out what has been done on schema
>> >                             model and think it is acceptable - regarding
>> >                             the json debate -
>> >                             if
>> https://issues.apache.org/jira/browse/BEAM-4381
>> >                             can be fixed.
>> >
>> >                             High level, it is about providing a
>> >                             mainstream and not too impacting model OOTB
>> >                             and JSON seems the most valid option for
>> >                             now, at least for IO and some user
>> transforms.
>> >
>> >                             Wdyt?
>> >
>> >                             Le ven. 27 avr. 2018 18:36, Romain
>> >                             Manni-Bucau <rmannibucau@gmail.com
>> >                             <ma...@gmail.com>> a écrit :
>> >
>> >                                  Can give it a try end of may, sure.
>> >                                 (holidays and work constraints will make
>> >                                 it hard before).
>> >
>> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>> >                                 <kedin@google.com
>> >                                 <ma...@google.com>> a écrit :
>> >
>> >                                     Romain,
>> >
>> >                                     I don't believe that JSON approach
>> >                                     was investigated very thoroughIy. I
>> >                                     mentioned few reasons which will
>> >                                     make it not the best choice my
>> >                                     opinion, but I may be wrong. Can you
>> >                                     put together a design doc or a
>> >                                     prototype?
>> >
>> >                                     Thank you,
>> >                                     Anton
>> >
>> >
>> >                                     On Thu, Apr 26, 2018 at 10:17 PM
>> >                                     Romain Manni-Bucau
>> >                                     <rmannibucau@gmail.com
>> >                                     <ma...@gmail.com>>
>> wrote:
>> >
>> >
>> >
>> >                                         Le 26 avr. 2018 23:13, "Anton
>> >                                         Kedin" <kedin@google.com
>> >                                         <ma...@google.com>> a
>> écrit :
>> >
>> >                                             BeamRecord (Row) has very
>> >                                             little in common with
>> >                                             JsonObject (I assume you're
>> >                                             talking about javax.json),
>> >                                             except maybe some
>> >                                             similarities of the API. Few
>> >                                             reasons why JsonObject
>> >                                             doesn't work:
>> >
>> >                                               * it is a Java EE API:
>> >                                                   o Beam SDK is not
>> >                                                     limited to Java.
>> >                                                     There are probably
>> >                                                     similar APIs for
>> >                                                     other languages but
>> >                                                     they might not
>> >                                                     necessarily carry
>> >                                                     the same semantics /
>> >                                                     APIs;
>> >
>> >
>> >                                         Not a big deal I think. At least
>> >                                         not a technical blocker.
>> >
>> >                                                   o It can change
>> >                                                     between Java
>> versions;
>> >
>> >                                         No, this is javaee ;).
>> >
>> >
>> >                                                   o Current Beam java
>> >                                                     implementation is an
>> >                                                     experimental feature
>> >                                                     to identify what's
>> >                                                     needed from such
>> >                                                     API, in the end we
>> >                                                     might end up with
>> >                                                     something similar to
>> >                                                     JsonObject API, but
>> >                                                     likely not
>> >
>> >
>> >                                         I dont get that point as a
>> blocker
>> >
>> >                                                   o ;
>> >                                               * represents JSON, which
>> >                                                 is not an API but an
>> >                                                 object notation:
>> >                                                   o it is defined as
>> >                                                     unicode string in a
>> >                                                     certain format. If
>> >                                                     you choose to adhere
>> >                                                     to ECMA-404, then it
>> >                                                     doesn't sound like
>> >                                                     JsonObject can
>> >                                                     represent an Avro
>> >                                                     object, if I'm
>> >                                                     reading it right;
>> >
>> >
>> >                                         It is in the generator impl, you
>> >                                         can impl an avrogenerator.
>> >
>> >                                               * doesn't define a type
>> >                                                 system (JSON does, but
>> >                                                 it's lacking):
>> >                                                   o for example, JSON
>> >                                                     doesn't define
>> >                                                     semantics for
>> numbers;
>> >                                                   o doesn't define
>> >                                                     date/time types;
>> >                                                   o doesn't allow
>> >                                                     extending JSON type
>> >                                                     system at all;
>> >
>> >
>> >                                         That is why you need a metada
>> >                                         object, or simpler, a schema
>> >                                         with that data. Json or beam
>> >                                         record doesnt help here and you
>> >                                         end up on the same outcome if
>> >                                         you think about it.
>> >
>> >                                               * lacks schemas;
>> >
>> >                                         Jsonschema are standard, widely
>> >                                         spread and tooled compared to
>> >                                         alternative.
>> >
>> >                                             You can definitely try
>> >                                             loosen the requirements and
>> >                                             define everything in JSON in
>> >                                             userland, but the point of
>> >                                             Row/Schema is to avoid it
>> >                                             and define everything in
>> >                                             Beam model, which can be
>> >                                             extended, mapped to JSON,
>> >                                             Avro, BigQuery Schemas,
>> >                                             custom binary format etc.,
>> >                                             with same semantics across
>> >                                             beam SDKs.
>> >
>> >
>> >                                         This is what jsonp would allow
>> >                                         with the benefit of a natural
>> >                                         pojo support through jsonb.
>> >
>> >
>> >
>> >                                             On Thu, Apr 26, 2018 at
>> >                                             12:28 PM Romain Manni-Bucau
>> >                                             <rmannibucau@gmail.com
>> >                                             <mailto:
>> rmannibucau@gmail.com>>
>> >                                             wrote:
>> >
>> >                                                 Just to let it be clear
>> >                                                 and let me understand:
>> >                                                 how is BeamRecord
>> >                                                 different from a
>> >                                                 JsonObject which is an
>> >                                                 API without
>> >                                                 implementation (not
>> >                                                 event a json one OOTB)?
>> >                                                 Advantage of json *api*
>> >                                                 are indeed natural
>> >                                                 mapping (jsonb is based
>> >                                                 on jsonp so no new
>> >                                                 binding to reinvent) and
>> >                                                 simple serialization
>> >                                                 (json+gzip for ex, or
>> >                                                 avro if you want to be
>> >                                                 geeky).
>> >
>> >                                                 I fail to see the point
>> >                                                 to rebuild an ecosystem
>> ATM.
>> >
>> >                                                 Le 26 avr. 2018 19:12,
>> >                                                 "Reuven Lax"
>> >                                                 <relax@google.com
>> >                                                 <mailto:
>> relax@google.com>>
>> >                                                 a écrit :
>> >
>> >                                                     Exactly what JB
>> >                                                     said. We will write
>> >                                                     a generic conversion
>> >                                                     from Avro (or json)
>> >                                                     to Beam schemas,
>> >                                                     which will make them
>> >                                                     work transparently
>> >                                                     with SQL. The plan
>> >                                                     is also to migrate
>> >                                                     Anton's work so that
>> >                                                     POJOs works
>> >                                                     generically for any
>> >                                                     schema.
>> >
>> >                                                     Reuven
>> >
>> >                                                     On Thu, Apr 26, 2018
>> >                                                     at 1:17 AM
>> >                                                     Jean-Baptiste Onofré
>> >                                                     <jb@nanthrax.net
>> >                                                     <mailto:
>> jb@nanthrax.net>>
>> >                                                     wrote:
>> >
>> >                                                         For now we have
>> >                                                         a generic schema
>> >                                                         interface.
>> >                                                         Json-b can be an
>> >                                                         impl, avro could
>> >                                                         be another one.
>> >
>> >                                                         Regards
>> >                                                         JB
>> >                                                         Le 26 avr. 2018,
>> >                                                         à 12:08, Romain
>> >                                                         Manni-Bucau
>> >                                                         <
>> rmannibucau@gmail.com
>> >                                                         <mailto:
>> rmannibucau@gmail.com>>
>> >                                                         a écrit:
>> >
>> >                                                             Hmm,
>> >
>> >                                                             avro has
>> >                                                             still the
>> >                                                             pitfalls to
>> >                                                             have an
>> >                                                             uncontrolled
>> >                                                             stack which
>> >                                                             brings way
>> >                                                             too much
>> >                                                             dependencies
>> >                                                             to be part
>> >                                                             of any API,
>> >                                                             this is why
>> >                                                             I proposed a
>> >                                                             JSON-P based
>> >                                                             API
>> >                                                             (JsonObject)
>> >                                                             with a
>> >                                                             custom beam
>> >                                                             entry for
>> >                                                             some
>> >                                                             metadata
>> >                                                             (headers "à
>> >                                                             la Camel").
>> >
>> >
>> >                                                             Romain
>> >                                                             Manni-Bucau
>> >                                                             @rmannibucau
>> >                                                             <
>> https://twitter.com/rmannibucau>
>> >                                                             |   Blog
>> >                                                             <
>> https://rmannibucau.metawerx.net/> |
>> >                                                             Old Blog
>> >                                                             <
>> http://rmannibucau.wordpress.com>
>> >                                                             |  Github
>> >                                                             <
>> https://github.com/rmannibucau> |
>> >                                                             LinkedIn
>> >                                                             <
>> https://www.linkedin.com/in/rmannibucau> |
>> >                                                             Book
>> >                                                             <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> >                                                             2018-04-26
>> >                                                             9:59
>> >                                                             GMT+02:00
>> >
>>  Jean-Baptiste Onofré
>> >                                                             <
>> jb@nanthrax.net
>> >                                                             <mailto:
>> jb@nanthrax.net>>:
>> >
>> >
>> >                                                                 Hi
>> Ismael
>> >
>> >                                                                 You mean
>> >                                                                 directly
>> >                                                                 in Beam
>> >                                                                 SQL ?
>> >
>> >                                                                 That
>> >                                                                 will be
>> >                                                                 part of
>> >                                                                 schema
>> >                                                                 support:
>> >                                                                 generic
>> >                                                                 record
>> >                                                                 could be
>> >                                                                 one of
>> >                                                                 the
>> >                                                                 payload
>> >                                                                 with
>> >                                                                 across
>> >                                                                 schema.
>> >
>> >                                                                 Regards
>> >                                                                 JB
>> >                                                                 Le 26
>> >                                                                 avr.
>> >                                                                 2018, à
>> >                                                                 11:39,
>> >                                                                 "Ismaël
>> >                                                                 Mejía" <
>> >
>> iemejia@gmail.com
>> >                                                                 <mailto:
>> iemejia@gmail.com>>
>> >                                                                 a écrit:
>> >
>> >
>>  Hello Anton,
>> >
>> >
>>  Thanks for the descriptive email and the really useful work. Any plans
>> >                                                                     to
>> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>> >                                                                     is
>> a natural fit for this approach too.
>> >
>> >
>>  Regards,
>> >
>>  Ismaël
>> >
>> >                                                                     On
>> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>> >
>>  <ma...@google.com>> wrote:
>> >
>> >
>>
>> >
>> >
>>  Hi,
>> >
>> >
>>  I want
>> >
>>  to
>> >
>>  highlight
>> >
>>  a couple
>> >
>>  of
>> >
>>  improvements
>> >
>>  to
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  we
>> >
>>  have
>> >
>>  been
>> >
>> >
>>  working
>> >
>>  on
>> >
>>  recently
>> >
>>  which
>> >
>>  are
>> >
>>  targeted
>> >
>>  to
>> >
>>  make
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  API
>> >
>>  easier
>> >
>>  to
>> >
>>  use.
>> >
>> >
>>  Specifically
>> >
>>  these
>> >
>>  features
>> >
>>  simplify
>> >
>>  conversion
>> >
>>  of
>> >
>>  Java
>> >
>>  Beans
>> >
>>  and
>> >
>>  JSON
>> >
>> >
>>  strings
>> >
>>  to
>> >
>>  Rows.
>> >
>> >
>> >
>>  Feel
>> >
>>  free
>> >
>>  to
>> >
>>  try
>> >
>>  this
>> >
>>  and
>> >
>>  send
>> >
>>  any
>> >
>>  bugs/comments/PRs
>> >
>>  my
>> >
>>  way.
>> >
>> >
>> >
>>  **Caveat:
>> >
>>  this
>> >
>>  is
>> >
>>  still
>> >
>>  work
>> >
>>  in
>> >
>>  progress,
>> >
>>  and
>> >
>>  has
>> >
>>  known
>> >
>>  bugs
>> >
>>  and
>> >
>>  incomplete
>> >
>> >
>>  features,
>> >
>>  see
>> >
>>  below
>> >
>>  for
>> >
>>  details.**
>> >
>> >
>> >
>>  Background
>> >
>> >
>> >
>>  Beam
>> >
>>  SQL
>> >
>>  queries
>> >
>>  can
>> >
>>  only
>> >
>>  be
>> >
>>  applied
>> >
>>  to
>> >
>>  PCollection<Row>.
>> >
>>  This
>> >
>>  means
>> >
>>  that
>> >
>> >
>>  users
>> >
>>  need
>> >
>>  to
>> >
>>  convert
>> >
>>  whatever
>> >
>>  PCollection
>> >
>>  elements
>> >
>>  they
>> >
>>  have
>> >
>>  to
>> >
>>  Rows
>> >
>>  before
>> >
>> >
>>  querying
>> >
>>  them
>> >
>>  with
>> >
>>  SQL.
>> >
>>  This
>> >
>>  usually
>> >
>>  requires
>> >
>
>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <jb...@nanthrax.net> a écrit :

> Hi,
>
> IMHO, it would be better to have a explicit transform/IO as converter.
>
> It would be easier for users.
>
> Another option would be to use a "TypeConverter/SchemaConverter" map as
> we do in Camel: Beam could check the source/destination "type" and check
> in the map if there's a converter available. This map can be store as
> part of the pipeline (as we do for filesystem registration).
>


It works in camel because it is not strongly typed, isnt it? So can require
a beam new pipeline api.

+1 for the explicit transform, if added to the pipeline api as coder it
wouldnt break the fluent api:

p.apply(io).setOutputType(Foo.class)

Coders can be a workaround since they owns the type but since the
pcollection is the real owner it is surely saner this way, no?

Also it needs to ensure all converters are present before running the
pipeline probably, no implicit environment converter support is probably
good to start to avoid late surprises.



> My $0.01
>
> Regards
> JB
>
> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
> > How does it work on the pipeline side?
> > Do you generate these "virtual" IO at build time to enable the fluent
> > API to work not erasing generics?
> >
> > ex: SQL(row)->BigQuery(native) will not compile so we need a
> > SQL(row)->BigQuery(row)
> >
> > Side note unrelated to Row: if you add another registry maybe a pretask
> > is to ensure beam has a kind of singleton/context to avoid to duplicate
> > it or not track it properly. These kind of converters will need a global
> > close and not only per record in general:
> > converter.init();converter.convert(row);....converter.destroy();,
> > otherwise it easily leaks. This is why it can require some way to not
> > recreate it. A quick fix, if you are in bytebuddy already, can be to add
> > it to setup/teardown pby, being more global would be nicer but is more
> > challenging.
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
> > <ma...@google.com>> a écrit :
> >
> >     No - the only modules we need to add to core are the ones we choose
> >     to add. For example, I will probably add a registration for
> >     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
> >     with schemas. However I will add that to the GCP module, so only
> >     someone depending on that module need to pull in that dependency.
> >     The Java ServiceLoader framework can be used by these modules to
> >     register schemas for their types (we already do something similar
> >     for FileSystem and for coders as well).
> >
> >     BTW, right now the conversion back and forth between Row objects I'm
> >     doing in the ByteBuddy generated bytecode that we generate in order
> >     to invoke DoFns.
> >
> >     Reuven
> >
> >     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
> >     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> >
> >         Hmm, the pluggability part is close to what I wanted to do with
> >         JsonObject as a main API (to avoid to redo a "row" API and
> >         schema API)
> >         Row.as(Class<T>) sounds good but then, does it mean we'll get
> >         beam-sdk-java-row-jsonobject like modules (I'm not against, just
> >         trying to understand here)?
> >         If so, how an IO can use as() with the type it expects? Doesnt
> >         it lead to have a tons of  these modules at the end?
> >
> >         Romain Manni-Bucau
> >         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >         <https://rmannibucau.metawerx.net/> | Old Blog
> >         <http://rmannibucau.wordpress.com> | Github
> >         <https://github.com/rmannibucau> | LinkedIn
> >         <https://www.linkedin.com/in/rmannibucau> | Book
> >         <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> >         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
> >         <ma...@google.com>> a écrit :
> >
> >             By the way Romain, if you have specific scenarios in mind I
> >             would love to hear them. I can try and guess what exactly
> >             you would like to get out of schemas, but it would work
> >             better if you gave me concrete scenarios that you would like
> >             to work.
> >
> >             Reuven
> >
> >             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <relax@google.com
> >             <ma...@google.com>> wrote:
> >
> >                 Yeah, what I'm working on will help with IO. Basically
> >                 if you register a function with SchemaRegistry that
> >                 converts back and forth between a type (say JsonObject)
> >                 and a Beam Row, then it is applied by the framework
> >                 behind the scenes as part of DoFn invocation. Concrete
> >                 example: let's say I have an IO that reads json objects
> >                   class MyJsonIORead extends PTransform<PBegin,
> >                 JsonObject> {...}
> >
> >                 If you register a schema for this type (or you can also
> >                 just set the schema directly on the output PCollection),
> >                 then Beam knows how to convert back and forth between
> >                 JsonObject and Row. So the next ParDo can look like
> >
> >                 p.apply(new MyJsonIORead())
> >                 .apply(ParDo.of(new DoFn<JsonObject, T>....
> >                     @ProcessElement void process(@Element Row row) {
> >                    })
> >
> >                 And Beam will automatically convert JsonObject to a Row
> >                 for processing (you aren't forced to do this of course -
> >                 you can always ask for it as a JsonObject).
> >
> >                 The same is true for output. If you have a sink that
> >                 takes in JsonObject but the transform before it produces
> >                 Row objects (for instance - because the transform before
> >                 it is Beam SQL), Beam can automatically convert Row back
> >                 to JsonObject for you.
> >
> >                 All of this was detailed in the Schema doc I shared a
> >                 few months ago. There was a lot of discussion on that
> >                 document from various parties, and some of this API is a
> >                 result of that discussion. This is also working in the
> >                 branch JB and I were working on, though not yet
> >                 integrated back to master.
> >
> >                 I would like to actually go further and make Row an
> >                 interface and provide a way to automatically put a Row
> >                 interface on top of any other object (e.g. JsonObject,
> >                 Pojo, etc.) This won't change the way the user writes
> >                 code, but instead of Beam having to copy and convert at
> >                 each stage (e.g. from JsonObject to Row) it simply will
> >                 create a Row object that uses the the JsonObject as its
> >                 underlying storage.
> >
> >                 Reuven
> >
> >                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
> >                 <rmannibucau@gmail.com <ma...@gmail.com>>
> >                 wrote:
> >
> >                     Well, beam can implement a new mapper but it doesnt
> >                     help for io. Most of modern backends will take json
> >                     directly, even javax one and it must stay generic.
> >
> >                     Then since json to pojo mapping is already done a
> >                     dozen of times, not sure it is worth it for now.
> >
> >                     Le mar. 22 mai 2018 20:27, Reuven Lax
> >                     <relax@google.com <ma...@google.com>> a
> écrit :
> >
> >                         We can do even better btw. Building a
> >                         SchemaRegistry where automatic conversions can
> >                         be registered between schema and Java data
> >                         types. With this the user won't even need a DoFn
> >                         to do the conversion.
> >
> >                         On Tue, May 22, 2018, 10:13 AM Romain
> >                         Manni-Bucau <rmannibucau@gmail.com
> >                         <ma...@gmail.com>> wrote:
> >
> >                             Hi guys,
> >
> >                             Checked out what has been done on schema
> >                             model and think it is acceptable - regarding
> >                             the json debate -
> >                             if
> https://issues.apache.org/jira/browse/BEAM-4381
> >                             can be fixed.
> >
> >                             High level, it is about providing a
> >                             mainstream and not too impacting model OOTB
> >                             and JSON seems the most valid option for
> >                             now, at least for IO and some user
> transforms.
> >
> >                             Wdyt?
> >
> >                             Le ven. 27 avr. 2018 18:36, Romain
> >                             Manni-Bucau <rmannibucau@gmail.com
> >                             <ma...@gmail.com>> a écrit :
> >
> >                                  Can give it a try end of may, sure.
> >                                 (holidays and work constraints will make
> >                                 it hard before).
> >
> >                                 Le 27 avr. 2018 18:26, "Anton Kedin"
> >                                 <kedin@google.com
> >                                 <ma...@google.com>> a écrit :
> >
> >                                     Romain,
> >
> >                                     I don't believe that JSON approach
> >                                     was investigated very thoroughIy. I
> >                                     mentioned few reasons which will
> >                                     make it not the best choice my
> >                                     opinion, but I may be wrong. Can you
> >                                     put together a design doc or a
> >                                     prototype?
> >
> >                                     Thank you,
> >                                     Anton
> >
> >
> >                                     On Thu, Apr 26, 2018 at 10:17 PM
> >                                     Romain Manni-Bucau
> >                                     <rmannibucau@gmail.com
> >                                     <ma...@gmail.com>>
> wrote:
> >
> >
> >
> >                                         Le 26 avr. 2018 23:13, "Anton
> >                                         Kedin" <kedin@google.com
> >                                         <ma...@google.com>> a
> écrit :
> >
> >                                             BeamRecord (Row) has very
> >                                             little in common with
> >                                             JsonObject (I assume you're
> >                                             talking about javax.json),
> >                                             except maybe some
> >                                             similarities of the API. Few
> >                                             reasons why JsonObject
> >                                             doesn't work:
> >
> >                                               * it is a Java EE API:
> >                                                   o Beam SDK is not
> >                                                     limited to Java.
> >                                                     There are probably
> >                                                     similar APIs for
> >                                                     other languages but
> >                                                     they might not
> >                                                     necessarily carry
> >                                                     the same semantics /
> >                                                     APIs;
> >
> >
> >                                         Not a big deal I think. At least
> >                                         not a technical blocker.
> >
> >                                                   o It can change
> >                                                     between Java
> versions;
> >
> >                                         No, this is javaee ;).
> >
> >
> >                                                   o Current Beam java
> >                                                     implementation is an
> >                                                     experimental feature
> >                                                     to identify what's
> >                                                     needed from such
> >                                                     API, in the end we
> >                                                     might end up with
> >                                                     something similar to
> >                                                     JsonObject API, but
> >                                                     likely not
> >
> >
> >                                         I dont get that point as a
> blocker
> >
> >                                                   o ;
> >                                               * represents JSON, which
> >                                                 is not an API but an
> >                                                 object notation:
> >                                                   o it is defined as
> >                                                     unicode string in a
> >                                                     certain format. If
> >                                                     you choose to adhere
> >                                                     to ECMA-404, then it
> >                                                     doesn't sound like
> >                                                     JsonObject can
> >                                                     represent an Avro
> >                                                     object, if I'm
> >                                                     reading it right;
> >
> >
> >                                         It is in the generator impl, you
> >                                         can impl an avrogenerator.
> >
> >                                               * doesn't define a type
> >                                                 system (JSON does, but
> >                                                 it's lacking):
> >                                                   o for example, JSON
> >                                                     doesn't define
> >                                                     semantics for
> numbers;
> >                                                   o doesn't define
> >                                                     date/time types;
> >                                                   o doesn't allow
> >                                                     extending JSON type
> >                                                     system at all;
> >
> >
> >                                         That is why you need a metada
> >                                         object, or simpler, a schema
> >                                         with that data. Json or beam
> >                                         record doesnt help here and you
> >                                         end up on the same outcome if
> >                                         you think about it.
> >
> >                                               * lacks schemas;
> >
> >                                         Jsonschema are standard, widely
> >                                         spread and tooled compared to
> >                                         alternative.
> >
> >                                             You can definitely try
> >                                             loosen the requirements and
> >                                             define everything in JSON in
> >                                             userland, but the point of
> >                                             Row/Schema is to avoid it
> >                                             and define everything in
> >                                             Beam model, which can be
> >                                             extended, mapped to JSON,
> >                                             Avro, BigQuery Schemas,
> >                                             custom binary format etc.,
> >                                             with same semantics across
> >                                             beam SDKs.
> >
> >
> >                                         This is what jsonp would allow
> >                                         with the benefit of a natural
> >                                         pojo support through jsonb.
> >
> >
> >
> >                                             On Thu, Apr 26, 2018 at
> >                                             12:28 PM Romain Manni-Bucau
> >                                             <rmannibucau@gmail.com
> >                                             <mailto:
> rmannibucau@gmail.com>>
> >                                             wrote:
> >
> >                                                 Just to let it be clear
> >                                                 and let me understand:
> >                                                 how is BeamRecord
> >                                                 different from a
> >                                                 JsonObject which is an
> >                                                 API without
> >                                                 implementation (not
> >                                                 event a json one OOTB)?
> >                                                 Advantage of json *api*
> >                                                 are indeed natural
> >                                                 mapping (jsonb is based
> >                                                 on jsonp so no new
> >                                                 binding to reinvent) and
> >                                                 simple serialization
> >                                                 (json+gzip for ex, or
> >                                                 avro if you want to be
> >                                                 geeky).
> >
> >                                                 I fail to see the point
> >                                                 to rebuild an ecosystem
> ATM.
> >
> >                                                 Le 26 avr. 2018 19:12,
> >                                                 "Reuven Lax"
> >                                                 <relax@google.com
> >                                                 <mailto:relax@google.com
> >>
> >                                                 a écrit :
> >
> >                                                     Exactly what JB
> >                                                     said. We will write
> >                                                     a generic conversion
> >                                                     from Avro (or json)
> >                                                     to Beam schemas,
> >                                                     which will make them
> >                                                     work transparently
> >                                                     with SQL. The plan
> >                                                     is also to migrate
> >                                                     Anton's work so that
> >                                                     POJOs works
> >                                                     generically for any
> >                                                     schema.
> >
> >                                                     Reuven
> >
> >                                                     On Thu, Apr 26, 2018
> >                                                     at 1:17 AM
> >                                                     Jean-Baptiste Onofré
> >                                                     <jb@nanthrax.net
> >                                                     <mailto:
> jb@nanthrax.net>>
> >                                                     wrote:
> >
> >                                                         For now we have
> >                                                         a generic schema
> >                                                         interface.
> >                                                         Json-b can be an
> >                                                         impl, avro could
> >                                                         be another one.
> >
> >                                                         Regards
> >                                                         JB
> >                                                         Le 26 avr. 2018,
> >                                                         à 12:08, Romain
> >                                                         Manni-Bucau
> >                                                         <
> rmannibucau@gmail.com
> >                                                         <mailto:
> rmannibucau@gmail.com>>
> >                                                         a écrit:
> >
> >                                                             Hmm,
> >
> >                                                             avro has
> >                                                             still the
> >                                                             pitfalls to
> >                                                             have an
> >                                                             uncontrolled
> >                                                             stack which
> >                                                             brings way
> >                                                             too much
> >                                                             dependencies
> >                                                             to be part
> >                                                             of any API,
> >                                                             this is why
> >                                                             I proposed a
> >                                                             JSON-P based
> >                                                             API
> >                                                             (JsonObject)
> >                                                             with a
> >                                                             custom beam
> >                                                             entry for
> >                                                             some
> >                                                             metadata
> >                                                             (headers "à
> >                                                             la Camel").
> >
> >
> >                                                             Romain
> >                                                             Manni-Bucau
> >                                                             @rmannibucau
> >                                                             <
> https://twitter.com/rmannibucau>
> >                                                             |   Blog
> >                                                             <
> https://rmannibucau.metawerx.net/> |
> >                                                             Old Blog
> >                                                             <
> http://rmannibucau.wordpress.com>
> >                                                             |  Github
> >                                                             <
> https://github.com/rmannibucau> |
> >                                                             LinkedIn
> >                                                             <
> https://www.linkedin.com/in/rmannibucau> |
> >                                                             Book
> >                                                             <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> >                                                             2018-04-26
> >                                                             9:59
> >                                                             GMT+02:00
> >
>  Jean-Baptiste Onofré
> >                                                             <
> jb@nanthrax.net
> >                                                             <mailto:
> jb@nanthrax.net>>:
> >
> >
> >                                                                 Hi Ismael
> >
> >                                                                 You mean
> >                                                                 directly
> >                                                                 in Beam
> >                                                                 SQL ?
> >
> >                                                                 That
> >                                                                 will be
> >                                                                 part of
> >                                                                 schema
> >                                                                 support:
> >                                                                 generic
> >                                                                 record
> >                                                                 could be
> >                                                                 one of
> >                                                                 the
> >                                                                 payload
> >                                                                 with
> >                                                                 across
> >                                                                 schema.
> >
> >                                                                 Regards
> >                                                                 JB
> >                                                                 Le 26
> >                                                                 avr.
> >                                                                 2018, à
> >                                                                 11:39,
> >                                                                 "Ismaël
> >                                                                 Mejía" <
> >
> iemejia@gmail.com
> >                                                                 <mailto:
> iemejia@gmail.com>>
> >                                                                 a écrit:
> >
> >
>  Hello Anton,
> >
> >
>  Thanks for the descriptive email and the really useful work. Any plans
> >                                                                     to
> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
> >                                                                     is a
> natural fit for this approach too.
> >
> >
>  Regards,
> >
>  Ismaël
> >
> >                                                                     On
> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
> >
>  <ma...@google.com>> wrote:
> >
> >
>
> >
> >
>  Hi,
> >
> >
>  I want
> >
>  to
> >
>  highlight
> >
>  a couple
> >
>  of
> >
>  improvements
> >
>  to
> >
>  Beam
> >
>  SQL
> >
>  we
> >
>  have
> >
>  been
> >
> >
>  working
> >
>  on
> >
>  recently
> >
>  which
> >
>  are
> >
>  targeted
> >
>  to
> >
>  make
> >
>  Beam
> >
>  SQL
> >
>  API
> >
>  easier
> >
>  to
> >
>  use.
> >
> >
>  Specifically
> >
>  these
> >
>  features
> >
>  simplify
> >
>  conversion
> >
>  of
> >
>  Java
> >
>  Beans
> >
>  and
> >
>  JSON
> >
> >
>  strings
> >
>  to
> >
>  Rows.
> >
> >
> >
>  Feel
> >
>  free
> >
>  to
> >
>  try
> >
>  this
> >
>  and
> >
>  send
> >
>  any
> >
>  bugs/comments/PRs
> >
>  my
> >
>  way.
> >
> >
> >
>  **Caveat:
> >
>  this
> >
>  is
> >
>  still
> >
>  work
> >
>  in
> >
>  progress,
> >
>  and
> >
>  has
> >
>  known
> >
>  bugs
> >
>  and
> >
>  incomplete
> >
> >
>  features,
> >
>  see
> >
>  below
> >
>  for
> >
>  details.**
> >
> >
> >
>  Background
> >
> >
> >
>  Beam
> >
>  SQL
> >
>  queries
> >
>  can
> >
>  only
> >
>  be
> >
>  applied
> >
>  to
> >
>  PCollection<Row>.
> >
>  This
> >
>  means
> >
>  that
> >
> >
>  users
> >
>  need
> >
>  to
> >
>  convert
> >
>  whatever
> >
>  PCollection
> >
>  elements
> >
>  they
> >
>  have
> >
>  to
> >
>  Rows
> >
>  before
> >
> >
>  querying
> >
>  them
> >
>  with
> >
>  SQL.
> >
>  This
> >
>  usually
> >
>  requires
> >

Re: Beam SQL Improvements

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,

IMHO, it would be better to have a explicit transform/IO as converter.

It would be easier for users.

Another option would be to use a "TypeConverter/SchemaConverter" map as
we do in Camel: Beam could check the source/destination "type" and check
in the map if there's a converter available. This map can be store as
part of the pipeline (as we do for filesystem registration).

My $0.01

Regards
JB

On 23/05/2018 07:51, Romain Manni-Bucau wrote:
> How does it work on the pipeline side?
> Do you generate these "virtual" IO at build time to enable the fluent
> API to work not erasing generics?
> 
> ex: SQL(row)->BigQuery(native) will not compile so we need a
> SQL(row)->BigQuery(row)
> 
> Side note unrelated to Row: if you add another registry maybe a pretask
> is to ensure beam has a kind of singleton/context to avoid to duplicate
> it or not track it properly. These kind of converters will need a global
> close and not only per record in general:
> converter.init();converter.convert(row);....converter.destroy();,
> otherwise it easily leaks. This is why it can require some way to not
> recreate it. A quick fix, if you are in bytebuddy already, can be to add
> it to setup/teardown pby, being more global would be nicer but is more
> challenging.
> 
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 
> Le mer. 23 mai 2018 à 07:22, Reuven Lax <relax@google.com
> <ma...@google.com>> a écrit :
> 
>     No - the only modules we need to add to core are the ones we choose
>     to add. For example, I will probably add a registration for
>     TableRow/TableSchema (GCP BigQuery) so these can work seamlessly
>     with schemas. However I will add that to the GCP module, so only
>     someone depending on that module need to pull in that dependency.
>     The Java ServiceLoader framework can be used by these modules to
>     register schemas for their types (we already do something similar
>     for FileSystem and for coders as well).
> 
>     BTW, right now the conversion back and forth between Row objects I'm
>     doing in the ByteBuddy generated bytecode that we generate in order
>     to invoke DoFns.
> 
>     Reuven
> 
>     On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau
>     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> 
>         Hmm, the pluggability part is close to what I wanted to do with
>         JsonObject as a main API (to avoid to redo a "row" API and
>         schema API)
>         Row.as(Class<T>) sounds good but then, does it mean we'll get
>         beam-sdk-java-row-jsonobject like modules (I'm not against, just
>         trying to understand here)?
>         If so, how an IO can use as() with the type it expects? Doesnt
>         it lead to have a tons of  these modules at the end?
> 
>         Romain Manni-Bucau
>         @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>         <https://rmannibucau.metawerx.net/> | Old Blog
>         <http://rmannibucau.wordpress.com> | Github
>         <https://github.com/rmannibucau> | LinkedIn
>         <https://www.linkedin.com/in/rmannibucau> | Book
>         <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 
>         Le mer. 23 mai 2018 à 04:57, Reuven Lax <relax@google.com
>         <ma...@google.com>> a écrit :
> 
>             By the way Romain, if you have specific scenarios in mind I
>             would love to hear them. I can try and guess what exactly
>             you would like to get out of schemas, but it would work
>             better if you gave me concrete scenarios that you would like
>             to work.
> 
>             Reuven
> 
>             On Tue, May 22, 2018 at 7:45 PM Reuven Lax <relax@google.com
>             <ma...@google.com>> wrote:
> 
>                 Yeah, what I'm working on will help with IO. Basically
>                 if you register a function with SchemaRegistry that
>                 converts back and forth between a type (say JsonObject)
>                 and a Beam Row, then it is applied by the framework
>                 behind the scenes as part of DoFn invocation. Concrete
>                 example: let's say I have an IO that reads json objects
>                   class MyJsonIORead extends PTransform<PBegin,
>                 JsonObject> {...}
> 
>                 If you register a schema for this type (or you can also
>                 just set the schema directly on the output PCollection),
>                 then Beam knows how to convert back and forth between
>                 JsonObject and Row. So the next ParDo can look like
> 
>                 p.apply(new MyJsonIORead())
>                 .apply(ParDo.of(new DoFn<JsonObject, T>....
>                     @ProcessElement void process(@Element Row row) {
>                    })
> 
>                 And Beam will automatically convert JsonObject to a Row
>                 for processing (you aren't forced to do this of course -
>                 you can always ask for it as a JsonObject).
> 
>                 The same is true for output. If you have a sink that
>                 takes in JsonObject but the transform before it produces
>                 Row objects (for instance - because the transform before
>                 it is Beam SQL), Beam can automatically convert Row back
>                 to JsonObject for you. 
> 
>                 All of this was detailed in the Schema doc I shared a
>                 few months ago. There was a lot of discussion on that
>                 document from various parties, and some of this API is a
>                 result of that discussion. This is also working in the
>                 branch JB and I were working on, though not yet
>                 integrated back to master.
> 
>                 I would like to actually go further and make Row an
>                 interface and provide a way to automatically put a Row
>                 interface on top of any other object (e.g. JsonObject,
>                 Pojo, etc.) This won't change the way the user writes
>                 code, but instead of Beam having to copy and convert at
>                 each stage (e.g. from JsonObject to Row) it simply will
>                 create a Row object that uses the the JsonObject as its
>                 underlying storage.
> 
>                 Reuven
> 
>                 On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau
>                 <rmannibucau@gmail.com <ma...@gmail.com>>
>                 wrote:
> 
>                     Well, beam can implement a new mapper but it doesnt
>                     help for io. Most of modern backends will take json
>                     directly, even javax one and it must stay generic.
> 
>                     Then since json to pojo mapping is already done a
>                     dozen of times, not sure it is worth it for now.
> 
>                     Le mar. 22 mai 2018 20:27, Reuven Lax
>                     <relax@google.com <ma...@google.com>> a écrit :
> 
>                         We can do even better btw. Building a
>                         SchemaRegistry where automatic conversions can
>                         be registered between schema and Java data
>                         types. With this the user won't even need a DoFn
>                         to do the conversion. 
> 
>                         On Tue, May 22, 2018, 10:13 AM Romain
>                         Manni-Bucau <rmannibucau@gmail.com
>                         <ma...@gmail.com>> wrote:
> 
>                             Hi guys,
> 
>                             Checked out what has been done on schema
>                             model and think it is acceptable - regarding
>                             the json debate - 
>                             if https://issues.apache.org/jira/browse/BEAM-4381
>                             can be fixed.
> 
>                             High level, it is about providing a
>                             mainstream and not too impacting model OOTB
>                             and JSON seems the most valid option for
>                             now, at least for IO and some user transforms.
> 
>                             Wdyt?
> 
>                             Le ven. 27 avr. 2018 18:36, Romain
>                             Manni-Bucau <rmannibucau@gmail.com
>                             <ma...@gmail.com>> a écrit :
> 
>                                  Can give it a try end of may, sure.
>                                 (holidays and work constraints will make
>                                 it hard before).
> 
>                                 Le 27 avr. 2018 18:26, "Anton Kedin"
>                                 <kedin@google.com
>                                 <ma...@google.com>> a écrit :
> 
>                                     Romain,
> 
>                                     I don't believe that JSON approach
>                                     was investigated very thoroughIy. I
>                                     mentioned few reasons which will
>                                     make it not the best choice my
>                                     opinion, but I may be wrong. Can you
>                                     put together a design doc or a
>                                     prototype?
> 
>                                     Thank you,
>                                     Anton
> 
> 
>                                     On Thu, Apr 26, 2018 at 10:17 PM
>                                     Romain Manni-Bucau
>                                     <rmannibucau@gmail.com
>                                     <ma...@gmail.com>> wrote:
> 
> 
> 
>                                         Le 26 avr. 2018 23:13, "Anton
>                                         Kedin" <kedin@google.com
>                                         <ma...@google.com>> a écrit :
> 
>                                             BeamRecord (Row) has very
>                                             little in common with
>                                             JsonObject (I assume you're
>                                             talking about javax.json),
>                                             except maybe some
>                                             similarities of the API. Few
>                                             reasons why JsonObject
>                                             doesn't work:
> 
>                                               * it is a Java EE API:
>                                                   o Beam SDK is not
>                                                     limited to Java.
>                                                     There are probably
>                                                     similar APIs for
>                                                     other languages but
>                                                     they might not
>                                                     necessarily carry
>                                                     the same semantics /
>                                                     APIs;
> 
> 
>                                         Not a big deal I think. At least
>                                         not a technical blocker.
> 
>                                                   o It can change
>                                                     between Java versions;
> 
>                                         No, this is javaee ;).
> 
> 
>                                                   o Current Beam java
>                                                     implementation is an
>                                                     experimental feature
>                                                     to identify what's
>                                                     needed from such
>                                                     API, in the end we
>                                                     might end up with
>                                                     something similar to
>                                                     JsonObject API, but
>                                                     likely not
> 
> 
>                                         I dont get that point as a blocker
> 
>                                                   o ;
>                                               * represents JSON, which
>                                                 is not an API but an
>                                                 object notation:
>                                                   o it is defined as
>                                                     unicode string in a
>                                                     certain format. If
>                                                     you choose to adhere
>                                                     to ECMA-404, then it
>                                                     doesn't sound like
>                                                     JsonObject can
>                                                     represent an Avro
>                                                     object, if I'm
>                                                     reading it right;
> 
> 
>                                         It is in the generator impl, you
>                                         can impl an avrogenerator.
> 
>                                               * doesn't define a type
>                                                 system (JSON does, but
>                                                 it's lacking):
>                                                   o for example, JSON
>                                                     doesn't define
>                                                     semantics for numbers;
>                                                   o doesn't define
>                                                     date/time types;
>                                                   o doesn't allow
>                                                     extending JSON type
>                                                     system at all;
> 
> 
>                                         That is why you need a metada
>                                         object, or simpler, a schema
>                                         with that data. Json or beam
>                                         record doesnt help here and you
>                                         end up on the same outcome if
>                                         you think about it.
> 
>                                               * lacks schemas;
> 
>                                         Jsonschema are standard, widely
>                                         spread and tooled compared to
>                                         alternative.
> 
>                                             You can definitely try
>                                             loosen the requirements and
>                                             define everything in JSON in
>                                             userland, but the point of
>                                             Row/Schema is to avoid it
>                                             and define everything in
>                                             Beam model, which can be
>                                             extended, mapped to JSON,
>                                             Avro, BigQuery Schemas,
>                                             custom binary format etc.,
>                                             with same semantics across
>                                             beam SDKs.
> 
> 
>                                         This is what jsonp would allow
>                                         with the benefit of a natural
>                                         pojo support through jsonb.
> 
> 
> 
>                                             On Thu, Apr 26, 2018 at
>                                             12:28 PM Romain Manni-Bucau
>                                             <rmannibucau@gmail.com
>                                             <ma...@gmail.com>>
>                                             wrote:
> 
>                                                 Just to let it be clear
>                                                 and let me understand:
>                                                 how is BeamRecord
>                                                 different from a
>                                                 JsonObject which is an
>                                                 API without
>                                                 implementation (not
>                                                 event a json one OOTB)?
>                                                 Advantage of json *api*
>                                                 are indeed natural
>                                                 mapping (jsonb is based
>                                                 on jsonp so no new
>                                                 binding to reinvent) and
>                                                 simple serialization
>                                                 (json+gzip for ex, or
>                                                 avro if you want to be
>                                                 geeky).
> 
>                                                 I fail to see the point
>                                                 to rebuild an ecosystem ATM.
> 
>                                                 Le 26 avr. 2018 19:12,
>                                                 "Reuven Lax"
>                                                 <relax@google.com
>                                                 <ma...@google.com>>
>                                                 a écrit :
> 
>                                                     Exactly what JB
>                                                     said. We will write
>                                                     a generic conversion
>                                                     from Avro (or json)
>                                                     to Beam schemas,
>                                                     which will make them
>                                                     work transparently
>                                                     with SQL. The plan
>                                                     is also to migrate
>                                                     Anton's work so that
>                                                     POJOs works
>                                                     generically for any
>                                                     schema.
> 
>                                                     Reuven
> 
>                                                     On Thu, Apr 26, 2018
>                                                     at 1:17 AM
>                                                     Jean-Baptiste Onofré
>                                                     <jb@nanthrax.net
>                                                     <ma...@nanthrax.net>>
>                                                     wrote:
> 
>                                                         For now we have
>                                                         a generic schema
>                                                         interface.
>                                                         Json-b can be an
>                                                         impl, avro could
>                                                         be another one.
> 
>                                                         Regards
>                                                         JB
>                                                         Le 26 avr. 2018,
>                                                         à 12:08, Romain
>                                                         Manni-Bucau
>                                                         <rmannibucau@gmail.com
>                                                         <ma...@gmail.com>>
>                                                         a écrit:
> 
>                                                             Hmm,
> 
>                                                             avro has
>                                                             still the
>                                                             pitfalls to
>                                                             have an
>                                                             uncontrolled
>                                                             stack which
>                                                             brings way
>                                                             too much
>                                                             dependencies
>                                                             to be part
>                                                             of any API,
>                                                             this is why
>                                                             I proposed a
>                                                             JSON-P based
>                                                             API
>                                                             (JsonObject)
>                                                             with a
>                                                             custom beam
>                                                             entry for
>                                                             some
>                                                             metadata
>                                                             (headers "à
>                                                             la Camel").
> 
> 
>                                                             Romain
>                                                             Manni-Bucau
>                                                             @rmannibucau
>                                                             <https://twitter.com/rmannibucau>
>                                                             |   Blog
>                                                             <https://rmannibucau.metawerx.net/> |
>                                                             Old Blog
>                                                             <http://rmannibucau.wordpress.com>
>                                                             |  Github
>                                                             <https://github.com/rmannibucau> |
>                                                             LinkedIn
>                                                             <https://www.linkedin.com/in/rmannibucau> |
>                                                             Book
>                                                             <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 
>                                                             2018-04-26
>                                                             9:59
>                                                             GMT+02:00
>                                                             Jean-Baptiste Onofré
>                                                             <jb@nanthrax.net
>                                                             <ma...@nanthrax.net>>:
> 
> 
>                                                                 Hi Ismael
> 
>                                                                 You mean
>                                                                 directly
>                                                                 in Beam
>                                                                 SQL ?
> 
>                                                                 That
>                                                                 will be
>                                                                 part of
>                                                                 schema
>                                                                 support:
>                                                                 generic
>                                                                 record
>                                                                 could be
>                                                                 one of
>                                                                 the
>                                                                 payload
>                                                                 with
>                                                                 across
>                                                                 schema.
> 
>                                                                 Regards
>                                                                 JB
>                                                                 Le 26
>                                                                 avr.
>                                                                 2018, à
>                                                                 11:39,
>                                                                 "Ismaël
>                                                                 Mejía" <
>                                                                 iemejia@gmail.com
>                                                                 <ma...@gmail.com>>
>                                                                 a écrit:
> 
>                                                                     Hello Anton,
> 
>                                                                     Thanks for the descriptive email and the really useful work. Any plans
>                                                                     to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>                                                                     is a natural fit for this approach too.
> 
>                                                                     Regards,
>                                                                     Ismaël
> 
>                                                                     On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <kedin@google.com
>                                                                     <ma...@google.com>> wrote:
> 
>                                                                              
> 
>                                                                         Hi,
> 
>                                                                         I want
>                                                                         to
>                                                                         highlight
>                                                                         a couple
>                                                                         of
>                                                                         improvements
>                                                                         to
>                                                                         Beam
>                                                                         SQL
>                                                                         we
>                                                                         have
>                                                                         been
> 
>                                                                         working
>                                                                         on
>                                                                         recently
>                                                                         which
>                                                                         are
>                                                                         targeted
>                                                                         to
>                                                                         make
>                                                                         Beam
>                                                                         SQL
>                                                                         API
>                                                                         easier
>                                                                         to
>                                                                         use.
> 
>                                                                         Specifically
>                                                                         these
>                                                                         features
>                                                                         simplify
>                                                                         conversion
>                                                                         of
>                                                                         Java
>                                                                         Beans
>                                                                         and
>                                                                         JSON
> 
>                                                                         strings
>                                                                         to
>                                                                         Rows.
> 
> 
>                                                                         Feel
>                                                                         free
>                                                                         to
>                                                                         try
>                                                                         this
>                                                                         and
>                                                                         send
>                                                                         any
>                                                                         bugs/comments/PRs
>                                                                         my
>                                                                         way.
> 
> 
>                                                                         **Caveat:
>                                                                         this
>                                                                         is
>                                                                         still
>                                                                         work
>                                                                         in
>                                                                         progress,
>                                                                         and
>                                                                         has
>                                                                         known
>                                                                         bugs
>                                                                         and
>                                                                         incomplete
> 
>                                                                         features,
>                                                                         see
>                                                                         below
>                                                                         for
>                                                                         details.**
> 
> 
>                                                                         Background
> 
> 
>                                                                         Beam
>                                                                         SQL
>                                                                         queries
>                                                                         can
>                                                                         only
>                                                                         be
>                                                                         applied
>                                                                         to
>                                                                         PCollection<Row>.
>                                                                         This
>                                                                         means
>                                                                         that
> 
>                                                                         users
>                                                                         need
>                                                                         to
>                                                                         convert
>                                                                         whatever
>                                                                         PCollection
>                                                                         elements
>                                                                         they
>                                                                         have
>                                                                         to
>                                                                         Rows
>                                                                         before
> 
>                                                                         querying
>                                                                         them
>                                                                         with
>                                                                         SQL.
>                                                                         This
>                                                                         usually
>                                                                         requires
>                                                                         manually
>                                                                         creating
>                                                                         a Schema
>                                                                         and
>                                                                         implementing
>                                                                         a custom
>                                                                         conversion
>                                                                         PTransform<PCollection<
>                                                                         Element>,
> 
>                                                                         PCollection<Row>>
>                                                                         (see
>                                                                         Beam
>                                                                         SQL
>                                                                         Guide).
> 
> 
>                                                                         The
>                                                                         improvements
>                                                                         described
>                                                                         here
>                                                                         are
>                                                                         an
>                                                                         attempt
>                                                                         to
>                                                                         reduce
>                                                                         this
>                                                                         overhead
>                                                                         for
>                                                                         few
>                                                                         common
>                                                                         cases,
>                                                                         as
>                                                                         a start.
> 
> 
>                                                                         Status
> 
> 
>                                                                         Introduced
>                                                                         a InferredRowCoder
>                                                                         to
>                                                                         automatically
>                                                                         generate
>                                                                         rows
>                                                                         from
>                                                                         beans.
> 
>                                                                         Removes
>                                                                         the
>                                                                         need
>                                                                         to
>                                                                         manually
>                                                                         define
>                                                                         a Schema
>                                                                         and
>                                                                         Row
>                                                                         conversion
>                                                                         logic;
> 
>                                                                         Introduced
>                                                                         JsonToRow
>                                                                         transform
>                                                                         to
>                                                                         automatically
>                                                                         parse
>                                                                         JSON
>                                                                         objects
>                                                                         to
>                                                                         Rows.
> 
>                                                                         Removes
>                                                                         the
>                                                                         need
>                                                                         to
>                                                                         manually
>                                                                         implement
>                                                                         a conversion
>                                                                         logic;
> 
>                                                                         This
>                                                                         is
>                                                                         still
>                                                                         experimental
>                                                                         work
>                                                                         in
>                                                                         progress,
>                                                                         APIs
>                                                                         will
>                                                                         likely
>                                                                         change;
> 
>                                                                         There
>                                                                         are
>                                                                         known
>                                                                         bugs/unsolved
>                                                                         problems;
> 
> 
> 
>                                                                         Java
>                                                                         Beans
> 
> 
>                                                                         Introduced
>                                                                         a coder
>                                                                         which
>                                                                         facilitates
>                                                                         Rows
>                                                                         generation
>                                                                         from
>                                                                         Java
>                                                                         Beans.
> 
>                                                                         Reduces
>                                                                         the
>                                                                         overhead
>                                                                         to:
> 
>                                                                             /**
>                                                                             Some
>                                                                             user-defined
>                                                                             Java
>                                                                             Bean
>                                                                             */
> 
>                                                                             class
>                                                                             JavaBeanObject
>                                                                             implements
>                                                                             Serializable
>                                                                             {
>                                                                             String
>                                                                             getName()
>                                                                             { ...
>                                                                             }
>                                                                             }
> 
> 
> 
>                                                                             //
>                                                                             Obtain
>                                                                             the
>                                                                             objects:
> 
>                                                                             PCollection<JavaBeanObject>
>                                                                             javaBeans
>                                                                             = ...;
> 
> 
> 
> 
>                                                                             //
>                                                                             Convert
>                                                                             to
>                                                                             Rows
>                                                                             and
>                                                                             apply
>                                                                             a SQL
>                                                                             query:
> 
>                                                                             PCollection<Row>
>                                                                             queryResult
>                                                                             =
>                                                                             javaBeans
> 
>                                                                             .setCoder(InferredRowCoder.
>                                                                             ofSerializable(JavaBeanObject.
>                                                                             class))
> 
>                                                                             .apply(BeamSql.query("SELECT
>                                                                             name
>                                                                             FROM
>                                                                             PCOLLECTION"));
> 
> 
> 
> 
>                                                                         Notice,
>                                                                         there
>                                                                         is
>                                                                         no
>                                                                         more
>                                                                         manual
>                                                                         Schema
>                                                                         definition
>                                                                         or
>                                                                         custom
>                                                                         conversion
> 
>                                                                         logic.
> 
> 
>                                                                         Links
> 
> 
>                                                                         example;
> 
>                                                                         InferredRowCoder;
> 
>                                                                         test;
> 
> 
> 
>                                                                         JSON
> 
> 
>                                                                         Introduced
>                                                                         JsonToRow
>                                                                         transform.
>                                                                         It
>                                                                         is
>                                                                         possible
>                                                                         to
>                                                                         query
>                                                                         a
>                                                                         PCollection<String>
>                                                                         that
>                                                                         contains
>                                                                         JSON
>                                                                         objects
>                                                                         like
>                                                                         this:
> 
> 
>                                                                             //
>                                                                             Assuming
>                                                                             JSON
>                                                                             objects
>                                                                             look
>                                                                             like
>                                                                             this:
> 
>                                                                             //
>                                                                             { "type"
>                                                                             : "foo",
>                                                                             "size"
>                                                                             : 333
>                                                                             }
> 
>                                                                             //
>                                                                             Define
>                                                                             a Schema:
> 
>                                                                             Schema
>                                                                             jsonSchema
>                                                                             =
>                                                                             Schema
> 
>                                                                             .builder()
> 
>                                                                             .addStringField("type")
> 
>                                                                             .addInt32Field("size")
> 
>                                                                             .build();
> 
> 
>                                                                             //
>                                                                             Obtain
>                                                                             PCollection
>                                                                             of
>                                                                             the
>                                                                             objects
>                                                                             in
>                                                                             JSON
>                                                                             format:
> 
>                                                                             PCollection<String>
>                                                                             jsonObjects
>                                                                             = ...
> 
> 
>                                                                             //
>                                                                             Convert
>                                                                             to
>                                                                             Rows
>                                                                             and
>                                                                             apply
>                                                                             a SQL
>                                                                             query:
> 
>                                                                             PCollection<Row>
>                                                                             queryResults
>                                                                             =
>                                                                             jsonObjects
> 
>                                                                             .apply(JsonToRow.withSchema(
>                                                                             jsonSchema))
> 
>                                                                             .apply(BeamSql.query("SELECT
>                                                                             type,
>                                                                             AVG(size)
>                                                                             FROM
>                                                                             PCOLLECTION
>                                                                             GROUP
>                                                                             BY
> 
>                                                                             type"));
> 
> 
> 
> 
>                                                                         Notice,
>                                                                         JSON
>                                                                         to
>                                                                         Row
>                                                                         conversion
>                                                                         is
>                                                                         done
>                                                                         by
>                                                                         JsonToRow
>                                                                         transform.
>                                                                         It
>                                                                         is
>                                                                         currently
>                                                                         required
>                                                                         to
>                                                                         supply
>                                                                         a Schema.
> 
> 
>                                                                         Links
> 
> 
>                                                                         JsonToRow;
> 
>                                                                         test/example;
> 
> 
> 
>                                                                         Going
>                                                                         Forward
> 
> 
>                                                                         fix
>                                                                         bugs
>                                                                         (BEAM-4163,
>                                                                         BEAM-4161
>                                                                         ...)
> 
>                                                                         implement
>                                                                         more
>                                                                         features
>                                                                         (BEAM-4167,
>                                                                         more
>                                                                         types
>                                                                         of
>                                                                         objects);
> 
>                                                                         wire
>                                                                         this
>                                                                         up
>                                                                         with
>                                                                         sources/sinks
>                                                                         to
>                                                                         further
>                                                                         simplify
>                                                                         SQL
>                                                                         API;
> 
> 
> 
>                                                                         Thank
>                                                                         you,
> 
>                                                                         Anton
> 
> 
> 
> 

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
On Tue, May 22, 2018 at 10:51 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> How does it work on the pipeline side?
> Do you generate these "virtual" IO at build time to enable the fluent API
> to work not erasing generics?
>

Yeah - so I've already added support for injected element parameters (I'm
going to send an email to dev and users to make sure everyone is aware of
it), and that will be in the next Beam release. Basically you can now write:

DoFn<InputT, OutputT>() {
  @ProcessElement public void process(InputT element,
OutputReceiver<OutputT> output) {
  }
}

So there's almost no need for ProcessContext anymore (I would like to
eventually support side inputs as well, at which point the only reason to
keep ProcessContext around is backwards compatibility). Since process() is
not a virtual method, the "type checking" is done at pipeline construction
time instead of compile time.


> ex: SQL(row)->BigQuery(native) will not compile so we need a
> SQL(row)->BigQuery(row)
>
> Side note unrelated to Row: if you add another registry maybe a pretask is
> to ensure beam has a kind of singleton/context to avoid to duplicate it or
> not track it properly. These kind of converters will need a global close
> and not only per record in general:
> converter.init();converter.convert(row);....converter.destroy();, otherwise
> it easily leaks. This is why it can require some way to not recreate it. A
> quick fix, if you are in bytebuddy already, can be to add it to
> setup/teardown pby, being more global would be nicer but is more
> challenging.
>

Right now I'm using Pipeline as the container, so the lifetime is the life
of the Pipeline. Do you think this is the wrong lifetime?


>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
>
> Le mer. 23 mai 2018 à 07:22, Reuven Lax <re...@google.com> a écrit :
>
>> No - the only modules we need to add to core are the ones we choose to
>> add. For example, I will probably add a registration for
>> TableRow/TableSchema (GCP BigQuery) so these can work seamlessly with
>> schemas. However I will add that to the GCP module, so only someone
>> depending on that module need to pull in that dependency. The Java
>> ServiceLoader framework can be used by these modules to register schemas
>> for their types (we already do something similar for FileSystem and for
>> coders as well).
>>
>> BTW, right now the conversion back and forth between Row objects I'm
>> doing in the ByteBuddy generated bytecode that we generate in order to
>> invoke DoFns.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Hmm, the pluggability part is close to what I wanted to do with
>>> JsonObject as a main API (to avoid to redo a "row" API and schema API)
>>> Row.as(Class<T>) sounds good but then, does it mean we'll get
>>> beam-sdk-java-row-jsonobject like modules (I'm not against, just trying to
>>> understand here)?
>>> If so, how an IO can use as() with the type it expects? Doesnt it lead
>>> to have a tons of  these modules at the end?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>>
>>> Le mer. 23 mai 2018 à 04:57, Reuven Lax <re...@google.com> a écrit :
>>>
>>>> By the way Romain, if you have specific scenarios in mind I would love
>>>> to hear them. I can try and guess what exactly you would like to get out of
>>>> schemas, but it would work better if you gave me concrete scenarios that
>>>> you would like to work.
>>>>
>>>> Reuven
>>>>
>>>> On Tue, May 22, 2018 at 7:45 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Yeah, what I'm working on will help with IO. Basically if you register
>>>>> a function with SchemaRegistry that converts back and forth between a type
>>>>> (say JsonObject) and a Beam Row, then it is applied by the framework behind
>>>>> the scenes as part of DoFn invocation. Concrete example: let's say I have
>>>>> an IO that reads json objects
>>>>>   class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}
>>>>>
>>>>> If you register a schema for this type (or you can also just set the
>>>>> schema directly on the output PCollection), then Beam knows how to convert
>>>>> back and forth between JsonObject and Row. So the next ParDo can look like
>>>>>
>>>>> p.apply(new MyJsonIORead())
>>>>> .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>>     @ProcessElement void process(@Element Row row) {
>>>>>    })
>>>>>
>>>>> And Beam will automatically convert JsonObject to a Row for processing
>>>>> (you aren't forced to do this of course - you can always ask for it as a
>>>>> JsonObject).
>>>>>
>>>>> The same is true for output. If you have a sink that takes in
>>>>> JsonObject but the transform before it produces Row objects (for instance -
>>>>> because the transform before it is Beam SQL), Beam can automatically
>>>>> convert Row back to JsonObject for you.
>>>>>
>>>>> All of this was detailed in the Schema doc I shared a few months ago.
>>>>> There was a lot of discussion on that document from various parties, and
>>>>> some of this API is a result of that discussion. This is also working in
>>>>> the branch JB and I were working on, though not yet integrated back to
>>>>> master.
>>>>>
>>>>> I would like to actually go further and make Row an interface and
>>>>> provide a way to automatically put a Row interface on top of any other
>>>>> object (e.g. JsonObject, Pojo, etc.) This won't change the way the user
>>>>> writes code, but instead of Beam having to copy and convert at each stage
>>>>> (e.g. from JsonObject to Row) it simply will create a Row object that uses
>>>>> the the JsonObject as its underlying storage.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Well, beam can implement a new mapper but it doesnt help for io. Most
>>>>>> of modern backends will take json directly, even javax one and it must stay
>>>>>> generic.
>>>>>>
>>>>>> Then since json to pojo mapping is already done a dozen of times, not
>>>>>> sure it is worth it for now.
>>>>>>
>>>>>> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>>>>>>
>>>>>>> We can do even better btw. Building a SchemaRegistry where automatic
>>>>>>> conversions can be registered between schema and Java data types. With this
>>>>>>> the user won't even need a DoFn to do the conversion.
>>>>>>>
>>>>>>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> Checked out what has been done on schema model and think it is
>>>>>>>> acceptable - regarding the json debate -  if
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>>>>>>
>>>>>>>> High level, it is about providing a mainstream and not too
>>>>>>>> impacting model OOTB and JSON seems the most valid option for now, at least
>>>>>>>> for IO and some user transforms.
>>>>>>>>
>>>>>>>> Wdyt?
>>>>>>>>
>>>>>>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <
>>>>>>>> rmannibucau@gmail.com> a écrit :
>>>>>>>>
>>>>>>>>>  Can give it a try end of may, sure. (holidays and work
>>>>>>>>> constraints will make it hard before).
>>>>>>>>>
>>>>>>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>>>
>>>>>>>>>> Romain,
>>>>>>>>>>
>>>>>>>>>> I don't believe that JSON approach was investigated very
>>>>>>>>>> thoroughIy. I mentioned few reasons which will make it not the best choice
>>>>>>>>>> my opinion, but I may be wrong. Can you put together a design doc or a
>>>>>>>>>> prototype?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Anton
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a
>>>>>>>>>>> écrit :
>>>>>>>>>>>
>>>>>>>>>>> BeamRecord (Row) has very little in common with JsonObject (I
>>>>>>>>>>> assume you're talking about javax.json), except maybe some similarities of
>>>>>>>>>>> the API. Few reasons why JsonObject doesn't work:
>>>>>>>>>>>
>>>>>>>>>>>    - it is a Java EE API:
>>>>>>>>>>>       - Beam SDK is not limited to Java. There are probably
>>>>>>>>>>>       similar APIs for other languages but they might not necessarily carry the
>>>>>>>>>>>       same semantics / APIs;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - It can change between Java versions;
>>>>>>>>>>>
>>>>>>>>>>> No, this is javaee ;).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Current Beam java implementation is an experimental
>>>>>>>>>>>       feature to identify what's needed from such API, in the end we might end up
>>>>>>>>>>>       with something similar to JsonObject API, but likely not
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I dont get that point as a blocker
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - ;
>>>>>>>>>>>       - represents JSON, which is not an API but an object
>>>>>>>>>>>    notation:
>>>>>>>>>>>       - it is defined as unicode string in a certain format. If
>>>>>>>>>>>       you choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>>>>>>       - doesn't define date/time types;
>>>>>>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That is why you need a metada object, or simpler, a schema with
>>>>>>>>>>> that data. Json or beam record doesnt help here and you end up on the same
>>>>>>>>>>> outcome if you think about it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - lacks schemas;
>>>>>>>>>>>
>>>>>>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>>>>>>> alternative.
>>>>>>>>>>>
>>>>>>>>>>> You can definitely try loosen the requirements and define
>>>>>>>>>>> everything in JSON in userland, but the point of Row/Schema is to avoid it
>>>>>>>>>>> and define everything in Beam model, which can be extended, mapped to JSON,
>>>>>>>>>>> Avro, BigQuery Schemas, custom binary format etc., with same semantics
>>>>>>>>>>> across beam SDKs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is what jsonp would allow with the benefit of a natural
>>>>>>>>>>> pojo support through jsonb.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Just to let it be clear and let me understand: how is
>>>>>>>>>>>> BeamRecord different from a JsonObject which is an API without
>>>>>>>>>>>> implementation (not event a json one OOTB)? Advantage of json *api* are
>>>>>>>>>>>> indeed natural mapping (jsonb is based on jsonp so no new binding to
>>>>>>>>>>>> reinvent) and simple serialization (json+gzip for ex, or avro if you want
>>>>>>>>>>>> to be geeky).
>>>>>>>>>>>>
>>>>>>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>>>>>>
>>>>>>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a
>>>>>>>>>>>> écrit :
>>>>>>>>>>>>
>>>>>>>>>>>>> Exactly what JB said. We will write a generic conversion from
>>>>>>>>>>>>> Avro (or json) to Beam schemas, which will make them work transparently
>>>>>>>>>>>>> with SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>>>>>>> generically for any schema.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now we have a generic schema interface. Json-b can be an
>>>>>>>>>>>>>> impl, avro could be another one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack
>>>>>>>>>>>>>>> which brings way too much dependencies to be part of any API,
>>>>>>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with
>>>>>>>>>>>>>>> a custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <
>>>>>>>>>>>>>>> jb@nanthrax.net>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That will be part of schema support: generic record could
>>>>>>>>>>>>>>>> be one of the payload with across schema.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" <
>>>>>>>>>>>>>>>> iemejia@gmail.com> a écrit:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
How does it work on the pipeline side?
Do you generate these "virtual" IO at build time to enable the fluent API
to work not erasing generics?

ex: SQL(row)->BigQuery(native) will not compile so we need a
SQL(row)->BigQuery(row)

Side note unrelated to Row: if you add another registry maybe a pretask is
to ensure beam has a kind of singleton/context to avoid to duplicate it or
not track it properly. These kind of converters will need a global close
and not only per record in general:
converter.init();converter.convert(row);....converter.destroy();, otherwise
it easily leaks. This is why it can require some way to not recreate it. A
quick fix, if you are in bytebuddy already, can be to add it to
setup/teardown pby, being more global would be nicer but is more
challenging.

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


Le mer. 23 mai 2018 à 07:22, Reuven Lax <re...@google.com> a écrit :

> No - the only modules we need to add to core are the ones we choose to
> add. For example, I will probably add a registration for
> TableRow/TableSchema (GCP BigQuery) so these can work seamlessly with
> schemas. However I will add that to the GCP module, so only someone
> depending on that module need to pull in that dependency. The Java
> ServiceLoader framework can be used by these modules to register schemas
> for their types (we already do something similar for FileSystem and for
> coders as well).
>
> BTW, right now the conversion back and forth between Row objects I'm doing
> in the ByteBuddy generated bytecode that we generate in order to invoke
> DoFns.
>
> Reuven
>
> On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hmm, the pluggability part is close to what I wanted to do with
>> JsonObject as a main API (to avoid to redo a "row" API and schema API)
>> Row.as(Class<T>) sounds good but then, does it mean we'll get
>> beam-sdk-java-row-jsonobject like modules (I'm not against, just trying to
>> understand here)?
>> If so, how an IO can use as() with the type it expects? Doesnt it lead to
>> have a tons of  these modules at the end?
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>>
>> Le mer. 23 mai 2018 à 04:57, Reuven Lax <re...@google.com> a écrit :
>>
>>> By the way Romain, if you have specific scenarios in mind I would love
>>> to hear them. I can try and guess what exactly you would like to get out of
>>> schemas, but it would work better if you gave me concrete scenarios that
>>> you would like to work.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 7:45 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> Yeah, what I'm working on will help with IO. Basically if you register
>>>> a function with SchemaRegistry that converts back and forth between a type
>>>> (say JsonObject) and a Beam Row, then it is applied by the framework behind
>>>> the scenes as part of DoFn invocation. Concrete example: let's say I have
>>>> an IO that reads json objects
>>>>   class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}
>>>>
>>>> If you register a schema for this type (or you can also just set the
>>>> schema directly on the output PCollection), then Beam knows how to convert
>>>> back and forth between JsonObject and Row. So the next ParDo can look like
>>>>
>>>> p.apply(new MyJsonIORead())
>>>> .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>>     @ProcessElement void process(@Element Row row) {
>>>>    })
>>>>
>>>> And Beam will automatically convert JsonObject to a Row for processing
>>>> (you aren't forced to do this of course - you can always ask for it as a
>>>> JsonObject).
>>>>
>>>> The same is true for output. If you have a sink that takes in
>>>> JsonObject but the transform before it produces Row objects (for instance -
>>>> because the transform before it is Beam SQL), Beam can automatically
>>>> convert Row back to JsonObject for you.
>>>>
>>>> All of this was detailed in the Schema doc I shared a few months ago.
>>>> There was a lot of discussion on that document from various parties, and
>>>> some of this API is a result of that discussion. This is also working in
>>>> the branch JB and I were working on, though not yet integrated back to
>>>> master.
>>>>
>>>> I would like to actually go further and make Row an interface and
>>>> provide a way to automatically put a Row interface on top of any other
>>>> object (e.g. JsonObject, Pojo, etc.) This won't change the way the user
>>>> writes code, but instead of Beam having to copy and convert at each stage
>>>> (e.g. from JsonObject to Row) it simply will create a Row object that uses
>>>> the the JsonObject as its underlying storage.
>>>>
>>>> Reuven
>>>>
>>>> On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Well, beam can implement a new mapper but it doesnt help for io. Most
>>>>> of modern backends will take json directly, even javax one and it must stay
>>>>> generic.
>>>>>
>>>>> Then since json to pojo mapping is already done a dozen of times, not
>>>>> sure it is worth it for now.
>>>>>
>>>>> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>>>>>
>>>>>> We can do even better btw. Building a SchemaRegistry where automatic
>>>>>> conversions can be registered between schema and Java data types. With this
>>>>>> the user won't even need a DoFn to do the conversion.
>>>>>>
>>>>>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> Checked out what has been done on schema model and think it is
>>>>>>> acceptable - regarding the json debate -  if
>>>>>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>>>>>
>>>>>>> High level, it is about providing a mainstream and not too impacting
>>>>>>> model OOTB and JSON seems the most valid option for now, at least for IO
>>>>>>> and some user transforms.
>>>>>>>
>>>>>>> Wdyt?
>>>>>>>
>>>>>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> a écrit :
>>>>>>>
>>>>>>>>  Can give it a try end of may, sure. (holidays and work constraints
>>>>>>>> will make it hard before).
>>>>>>>>
>>>>>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>>
>>>>>>>>> Romain,
>>>>>>>>>
>>>>>>>>> I don't believe that JSON approach was investigated very
>>>>>>>>> thoroughIy. I mentioned few reasons which will make it not the best choice
>>>>>>>>> my opinion, but I may be wrong. Can you put together a design doc or a
>>>>>>>>> prototype?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>>>>
>>>>>>>>>> BeamRecord (Row) has very little in common with JsonObject (I
>>>>>>>>>> assume you're talking about javax.json), except maybe some similarities of
>>>>>>>>>> the API. Few reasons why JsonObject doesn't work:
>>>>>>>>>>
>>>>>>>>>>    - it is a Java EE API:
>>>>>>>>>>       - Beam SDK is not limited to Java. There are probably
>>>>>>>>>>       similar APIs for other languages but they might not necessarily carry the
>>>>>>>>>>       same semantics / APIs;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - It can change between Java versions;
>>>>>>>>>>
>>>>>>>>>> No, this is javaee ;).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Current Beam java implementation is an experimental feature
>>>>>>>>>>       to identify what's needed from such API, in the end we might end up with
>>>>>>>>>>       something similar to JsonObject API, but likely not
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I dont get that point as a blocker
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - ;
>>>>>>>>>>       - represents JSON, which is not an API but an object
>>>>>>>>>>    notation:
>>>>>>>>>>       - it is defined as unicode string in a certain format. If
>>>>>>>>>>       you choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>>>>>       - doesn't define date/time types;
>>>>>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That is why you need a metada object, or simpler, a schema with
>>>>>>>>>> that data. Json or beam record doesnt help here and you end up on the same
>>>>>>>>>> outcome if you think about it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - lacks schemas;
>>>>>>>>>>
>>>>>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>>>>>> alternative.
>>>>>>>>>>
>>>>>>>>>> You can definitely try loosen the requirements and define
>>>>>>>>>> everything in JSON in userland, but the point of Row/Schema is to avoid it
>>>>>>>>>> and define everything in Beam model, which can be extended, mapped to JSON,
>>>>>>>>>> Avro, BigQuery Schemas, custom binary format etc., with same semantics
>>>>>>>>>> across beam SDKs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>>>>>>> support through jsonb.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>>>>>>
>>>>>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>>>>>
>>>>>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>>>>>>
>>>>>>>>>>>> Exactly what JB said. We will write a generic conversion from
>>>>>>>>>>>> Avro (or json) to Beam schemas, which will make them work transparently
>>>>>>>>>>>> with SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>>>>>> generically for any schema.
>>>>>>>>>>>>
>>>>>>>>>>>> Reuven
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For now we have a generic schema interface. Json-b can be an
>>>>>>>>>>>>> impl, avro could be another one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack
>>>>>>>>>>>>>> which brings way too much dependencies to be part of any API,
>>>>>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <
>>>>>>>>>>>>>> jb@nanthrax.net>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That will be part of schema support: generic record could be
>>>>>>>>>>>>>>> one of the payload with across schema.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com>
>>>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
No - the only modules we need to add to core are the ones we choose to add.
For example, I will probably add a registration for TableRow/TableSchema
(GCP BigQuery) so these can work seamlessly with schemas. However I will
add that to the GCP module, so only someone depending on that module need
to pull in that dependency. The Java ServiceLoader framework can be used by
these modules to register schemas for their types (we already do something
similar for FileSystem and for coders as well).

BTW, right now the conversion back and forth between Row objects I'm doing
in the ByteBuddy generated bytecode that we generate in order to invoke
DoFns.

Reuven

On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hmm, the pluggability part is close to what I wanted to do with JsonObject
> as a main API (to avoid to redo a "row" API and schema API)
> Row.as(Class<T>) sounds good but then, does it mean we'll get
> beam-sdk-java-row-jsonobject like modules (I'm not against, just trying to
> understand here)?
> If so, how an IO can use as() with the type it expects? Doesnt it lead to
> have a tons of  these modules at the end?
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
>
> Le mer. 23 mai 2018 à 04:57, Reuven Lax <re...@google.com> a écrit :
>
>> By the way Romain, if you have specific scenarios in mind I would love to
>> hear them. I can try and guess what exactly you would like to get out of
>> schemas, but it would work better if you gave me concrete scenarios that
>> you would like to work.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 7:45 PM Reuven Lax <re...@google.com> wrote:
>>
>>> Yeah, what I'm working on will help with IO. Basically if you register a
>>> function with SchemaRegistry that converts back and forth between a type
>>> (say JsonObject) and a Beam Row, then it is applied by the framework behind
>>> the scenes as part of DoFn invocation. Concrete example: let's say I have
>>> an IO that reads json objects
>>>   class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}
>>>
>>> If you register a schema for this type (or you can also just set the
>>> schema directly on the output PCollection), then Beam knows how to convert
>>> back and forth between JsonObject and Row. So the next ParDo can look like
>>>
>>> p.apply(new MyJsonIORead())
>>> .apply(ParDo.of(new DoFn<JsonObject, T>....
>>>     @ProcessElement void process(@Element Row row) {
>>>    })
>>>
>>> And Beam will automatically convert JsonObject to a Row for processing
>>> (you aren't forced to do this of course - you can always ask for it as a
>>> JsonObject).
>>>
>>> The same is true for output. If you have a sink that takes in JsonObject
>>> but the transform before it produces Row objects (for instance - because
>>> the transform before it is Beam SQL), Beam can automatically convert Row
>>> back to JsonObject for you.
>>>
>>> All of this was detailed in the Schema doc I shared a few months ago.
>>> There was a lot of discussion on that document from various parties, and
>>> some of this API is a result of that discussion. This is also working in
>>> the branch JB and I were working on, though not yet integrated back to
>>> master.
>>>
>>> I would like to actually go further and make Row an interface and
>>> provide a way to automatically put a Row interface on top of any other
>>> object (e.g. JsonObject, Pojo, etc.) This won't change the way the user
>>> writes code, but instead of Beam having to copy and convert at each stage
>>> (e.g. from JsonObject to Row) it simply will create a Row object that uses
>>> the the JsonObject as its underlying storage.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Well, beam can implement a new mapper but it doesnt help for io. Most
>>>> of modern backends will take json directly, even javax one and it must stay
>>>> generic.
>>>>
>>>> Then since json to pojo mapping is already done a dozen of times, not
>>>> sure it is worth it for now.
>>>>
>>>> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>>>>
>>>>> We can do even better btw. Building a SchemaRegistry where automatic
>>>>> conversions can be registered between schema and Java data types. With this
>>>>> the user won't even need a DoFn to do the conversion.
>>>>>
>>>>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> Checked out what has been done on schema model and think it is
>>>>>> acceptable - regarding the json debate -  if
>>>>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>>>>
>>>>>> High level, it is about providing a mainstream and not too impacting
>>>>>> model OOTB and JSON seems the most valid option for now, at least for IO
>>>>>> and some user transforms.
>>>>>>
>>>>>> Wdyt?
>>>>>>
>>>>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>>  Can give it a try end of may, sure. (holidays and work constraints
>>>>>>> will make it hard before).
>>>>>>>
>>>>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>
>>>>>>>> Romain,
>>>>>>>>
>>>>>>>> I don't believe that JSON approach was investigated very
>>>>>>>> thoroughIy. I mentioned few reasons which will make it not the best choice
>>>>>>>> my opinion, but I may be wrong. Can you put together a design doc or a
>>>>>>>> prototype?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Anton
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>>>
>>>>>>>>> BeamRecord (Row) has very little in common with JsonObject (I
>>>>>>>>> assume you're talking about javax.json), except maybe some similarities of
>>>>>>>>> the API. Few reasons why JsonObject doesn't work:
>>>>>>>>>
>>>>>>>>>    - it is a Java EE API:
>>>>>>>>>       - Beam SDK is not limited to Java. There are probably
>>>>>>>>>       similar APIs for other languages but they might not necessarily carry the
>>>>>>>>>       same semantics / APIs;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - It can change between Java versions;
>>>>>>>>>
>>>>>>>>> No, this is javaee ;).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Current Beam java implementation is an experimental feature
>>>>>>>>>       to identify what's needed from such API, in the end we might end up with
>>>>>>>>>       something similar to JsonObject API, but likely not
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I dont get that point as a blocker
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - ;
>>>>>>>>>       - represents JSON, which is not an API but an object
>>>>>>>>>    notation:
>>>>>>>>>       - it is defined as unicode string in a certain format. If
>>>>>>>>>       you choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>>>>       - doesn't define date/time types;
>>>>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is why you need a metada object, or simpler, a schema with
>>>>>>>>> that data. Json or beam record doesnt help here and you end up on the same
>>>>>>>>> outcome if you think about it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - lacks schemas;
>>>>>>>>>
>>>>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>>>>> alternative.
>>>>>>>>>
>>>>>>>>> You can definitely try loosen the requirements and define
>>>>>>>>> everything in JSON in userland, but the point of Row/Schema is to avoid it
>>>>>>>>> and define everything in Beam model, which can be extended, mapped to JSON,
>>>>>>>>> Avro, BigQuery Schemas, custom binary format etc., with same semantics
>>>>>>>>> across beam SDKs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>>>>>> support through jsonb.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>>>>>
>>>>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>>>>
>>>>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>>>>>
>>>>>>>>>>> Exactly what JB said. We will write a generic conversion from
>>>>>>>>>>> Avro (or json) to Beam schemas, which will make them work transparently
>>>>>>>>>>> with SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>>>>> generically for any schema.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For now we have a generic schema interface. Json-b can be an
>>>>>>>>>>>> impl, avro could be another one.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> JB
>>>>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>>
>>>>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack
>>>>>>>>>>>>> which brings way too much dependencies to be part of any API,
>>>>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <
>>>>>>>>>>>>> jb@nanthrax.net>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That will be part of schema support: generic record could be
>>>>>>>>>>>>>> one of the payload with across schema.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com>
>>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Hmm, the pluggability part is close to what I wanted to do with JsonObject
as a main API (to avoid to redo a "row" API and schema API)
Row.as(Class<T>) sounds good but then, does it mean we'll get
beam-sdk-java-row-jsonobject like modules (I'm not against, just trying to
understand here)?
If so, how an IO can use as() with the type it expects? Doesnt it lead to
have a tons of  these modules at the end?

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>


Le mer. 23 mai 2018 à 04:57, Reuven Lax <re...@google.com> a écrit :

> By the way Romain, if you have specific scenarios in mind I would love to
> hear them. I can try and guess what exactly you would like to get out of
> schemas, but it would work better if you gave me concrete scenarios that
> you would like to work.
>
> Reuven
>
> On Tue, May 22, 2018 at 7:45 PM Reuven Lax <re...@google.com> wrote:
>
>> Yeah, what I'm working on will help with IO. Basically if you register a
>> function with SchemaRegistry that converts back and forth between a type
>> (say JsonObject) and a Beam Row, then it is applied by the framework behind
>> the scenes as part of DoFn invocation. Concrete example: let's say I have
>> an IO that reads json objects
>>   class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}
>>
>> If you register a schema for this type (or you can also just set the
>> schema directly on the output PCollection), then Beam knows how to convert
>> back and forth between JsonObject and Row. So the next ParDo can look like
>>
>> p.apply(new MyJsonIORead())
>> .apply(ParDo.of(new DoFn<JsonObject, T>....
>>     @ProcessElement void process(@Element Row row) {
>>    })
>>
>> And Beam will automatically convert JsonObject to a Row for processing
>> (you aren't forced to do this of course - you can always ask for it as a
>> JsonObject).
>>
>> The same is true for output. If you have a sink that takes in JsonObject
>> but the transform before it produces Row objects (for instance - because
>> the transform before it is Beam SQL), Beam can automatically convert Row
>> back to JsonObject for you.
>>
>> All of this was detailed in the Schema doc I shared a few months ago.
>> There was a lot of discussion on that document from various parties, and
>> some of this API is a result of that discussion. This is also working in
>> the branch JB and I were working on, though not yet integrated back to
>> master.
>>
>> I would like to actually go further and make Row an interface and provide
>> a way to automatically put a Row interface on top of any other object (e.g.
>> JsonObject, Pojo, etc.) This won't change the way the user writes code, but
>> instead of Beam having to copy and convert at each stage (e.g. from
>> JsonObject to Row) it simply will create a Row object that uses the the
>> JsonObject as its underlying storage.
>>
>> Reuven
>>
>> On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Well, beam can implement a new mapper but it doesnt help for io. Most of
>>> modern backends will take json directly, even javax one and it must stay
>>> generic.
>>>
>>> Then since json to pojo mapping is already done a dozen of times, not
>>> sure it is worth it for now.
>>>
>>> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>>>
>>>> We can do even better btw. Building a SchemaRegistry where automatic
>>>> conversions can be registered between schema and Java data types. With this
>>>> the user won't even need a DoFn to do the conversion.
>>>>
>>>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> Checked out what has been done on schema model and think it is
>>>>> acceptable - regarding the json debate -  if
>>>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>>>
>>>>> High level, it is about providing a mainstream and not too impacting
>>>>> model OOTB and JSON seems the most valid option for now, at least for IO
>>>>> and some user transforms.
>>>>>
>>>>> Wdyt?
>>>>>
>>>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>>  Can give it a try end of may, sure. (holidays and work constraints
>>>>>> will make it hard before).
>>>>>>
>>>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>
>>>>>>> Romain,
>>>>>>>
>>>>>>> I don't believe that JSON approach was investigated very thoroughIy.
>>>>>>> I mentioned few reasons which will make it not the best choice my opinion,
>>>>>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>>
>>>>>>>> BeamRecord (Row) has very little in common with JsonObject (I
>>>>>>>> assume you're talking about javax.json), except maybe some similarities of
>>>>>>>> the API. Few reasons why JsonObject doesn't work:
>>>>>>>>
>>>>>>>>    - it is a Java EE API:
>>>>>>>>       - Beam SDK is not limited to Java. There are probably
>>>>>>>>       similar APIs for other languages but they might not necessarily carry the
>>>>>>>>       same semantics / APIs;
>>>>>>>>
>>>>>>>>
>>>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - It can change between Java versions;
>>>>>>>>
>>>>>>>> No, this is javaee ;).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Current Beam java implementation is an experimental feature
>>>>>>>>       to identify what's needed from such API, in the end we might end up with
>>>>>>>>       something similar to JsonObject API, but likely not
>>>>>>>>
>>>>>>>>
>>>>>>>> I dont get that point as a blocker
>>>>>>>>
>>>>>>>>
>>>>>>>>    - ;
>>>>>>>>       - represents JSON, which is not an API but an object
>>>>>>>>    notation:
>>>>>>>>       - it is defined as unicode string in a certain format. If
>>>>>>>>       you choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>>>
>>>>>>>>
>>>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>>>       - doesn't define date/time types;
>>>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>>>
>>>>>>>>
>>>>>>>> That is why you need a metada object, or simpler, a schema with
>>>>>>>> that data. Json or beam record doesnt help here and you end up on the same
>>>>>>>> outcome if you think about it.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - lacks schemas;
>>>>>>>>
>>>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>>>> alternative.
>>>>>>>>
>>>>>>>> You can definitely try loosen the requirements and define
>>>>>>>> everything in JSON in userland, but the point of Row/Schema is to avoid it
>>>>>>>> and define everything in Beam model, which can be extended, mapped to JSON,
>>>>>>>> Avro, BigQuery Schemas, custom binary format etc., with same semantics
>>>>>>>> across beam SDKs.
>>>>>>>>
>>>>>>>>
>>>>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>>>>> support through jsonb.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>>>>
>>>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>>>
>>>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>>>>
>>>>>>>>>> Exactly what JB said. We will write a generic conversion from
>>>>>>>>>> Avro (or json) to Beam schemas, which will make them work transparently
>>>>>>>>>> with SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>>>> generically for any schema.
>>>>>>>>>>
>>>>>>>>>> Reuven
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> For now we have a generic schema interface. Json-b can be an
>>>>>>>>>>> impl, avro could be another one.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> JB
>>>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm,
>>>>>>>>>>>>
>>>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb@nanthrax.net
>>>>>>>>>>>> >:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> That will be part of schema support: generic record could be
>>>>>>>>>>>>> one of the payload with across schema.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com>
>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
By the way Romain, if you have specific scenarios in mind I would love to
hear them. I can try and guess what exactly you would like to get out of
schemas, but it would work better if you gave me concrete scenarios that
you would like to work.

Reuven

On Tue, May 22, 2018 at 7:45 PM Reuven Lax <re...@google.com> wrote:

> Yeah, what I'm working on will help with IO. Basically if you register a
> function with SchemaRegistry that converts back and forth between a type
> (say JsonObject) and a Beam Row, then it is applied by the framework behind
> the scenes as part of DoFn invocation. Concrete example: let's say I have
> an IO that reads json objects
>   class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}
>
> If you register a schema for this type (or you can also just set the
> schema directly on the output PCollection), then Beam knows how to convert
> back and forth between JsonObject and Row. So the next ParDo can look like
>
> p.apply(new MyJsonIORead())
> .apply(ParDo.of(new DoFn<JsonObject, T>....
>     @ProcessElement void process(@Element Row row) {
>    })
>
> And Beam will automatically convert JsonObject to a Row for processing
> (you aren't forced to do this of course - you can always ask for it as a
> JsonObject).
>
> The same is true for output. If you have a sink that takes in JsonObject
> but the transform before it produces Row objects (for instance - because
> the transform before it is Beam SQL), Beam can automatically convert Row
> back to JsonObject for you.
>
> All of this was detailed in the Schema doc I shared a few months ago.
> There was a lot of discussion on that document from various parties, and
> some of this API is a result of that discussion. This is also working in
> the branch JB and I were working on, though not yet integrated back to
> master.
>
> I would like to actually go further and make Row an interface and provide
> a way to automatically put a Row interface on top of any other object (e.g.
> JsonObject, Pojo, etc.) This won't change the way the user writes code, but
> instead of Beam having to copy and convert at each stage (e.g. from
> JsonObject to Row) it simply will create a Row object that uses the the
> JsonObject as its underlying storage.
>
> Reuven
>
> On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Well, beam can implement a new mapper but it doesnt help for io. Most of
>> modern backends will take json directly, even javax one and it must stay
>> generic.
>>
>> Then since json to pojo mapping is already done a dozen of times, not
>> sure it is worth it for now.
>>
>> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>>
>>> We can do even better btw. Building a SchemaRegistry where automatic
>>> conversions can be registered between schema and Java data types. With this
>>> the user won't even need a DoFn to do the conversion.
>>>
>>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <rm...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> Checked out what has been done on schema model and think it is
>>>> acceptable - regarding the json debate -  if
>>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>>
>>>> High level, it is about providing a mainstream and not too impacting
>>>> model OOTB and JSON seems the most valid option for now, at least for IO
>>>> and some user transforms.
>>>>
>>>> Wdyt?
>>>>
>>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com>
>>>> a écrit :
>>>>
>>>>>  Can give it a try end of may, sure. (holidays and work constraints
>>>>> will make it hard before).
>>>>>
>>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>
>>>>>> Romain,
>>>>>>
>>>>>> I don't believe that JSON approach was investigated very thoroughIy.
>>>>>> I mentioned few reasons which will make it not the best choice my opinion,
>>>>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>>>>
>>>>>> Thank you,
>>>>>> Anton
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>>
>>>>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>>>>> you're talking about javax.json), except maybe some similarities of the
>>>>>>> API. Few reasons why JsonObject doesn't work:
>>>>>>>
>>>>>>>    - it is a Java EE API:
>>>>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>>>>       APIs for other languages but they might not necessarily carry the same
>>>>>>>       semantics / APIs;
>>>>>>>
>>>>>>>
>>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>>
>>>>>>>
>>>>>>>    - It can change between Java versions;
>>>>>>>
>>>>>>> No, this is javaee ;).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Current Beam java implementation is an experimental feature to
>>>>>>>       identify what's needed from such API, in the end we might end up with
>>>>>>>       something similar to JsonObject API, but likely not
>>>>>>>
>>>>>>>
>>>>>>> I dont get that point as a blocker
>>>>>>>
>>>>>>>
>>>>>>>    - ;
>>>>>>>       - represents JSON, which is not an API but an object notation:
>>>>>>>       - it is defined as unicode string in a certain format. If you
>>>>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>>
>>>>>>>
>>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>>
>>>>>>>
>>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>>       - doesn't define date/time types;
>>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>>
>>>>>>>
>>>>>>> That is why you need a metada object, or simpler, a schema with that
>>>>>>> data. Json or beam record doesnt help here and you end up on the same
>>>>>>> outcome if you think about it.
>>>>>>>
>>>>>>>
>>>>>>>    - lacks schemas;
>>>>>>>
>>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>>> alternative.
>>>>>>>
>>>>>>> You can definitely try loosen the requirements and define everything
>>>>>>> in JSON in userland, but the point of Row/Schema is to avoid it and define
>>>>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>>>>> beam SDKs.
>>>>>>>
>>>>>>>
>>>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>>>> support through jsonb.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>
>>>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>>>
>>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>>
>>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>>>
>>>>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>>> generically for any schema.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>
>>>>>>>>>> For now we have a generic schema interface. Json-b can be an
>>>>>>>>>> impl, avro could be another one.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>>
>>>>>>>>>>> Hmm,
>>>>>>>>>>>
>>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>>
>>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>>
>>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>>
>>>>>>>>>>>> That will be part of schema support: generic record could be
>>>>>>>>>>>> one of the payload with across schema.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> JB
>>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com>
>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
Yeah, what I'm working on will help with IO. Basically if you register a
function with SchemaRegistry that converts back and forth between a type
(say JsonObject) and a Beam Row, then it is applied by the framework behind
the scenes as part of DoFn invocation. Concrete example: let's say I have
an IO that reads json objects
  class MyJsonIORead extends PTransform<PBegin, JsonObject> {...}

If you register a schema for this type (or you can also just set the schema
directly on the output PCollection), then Beam knows how to convert back
and forth between JsonObject and Row. So the next ParDo can look like

p.apply(new MyJsonIORead())
.apply(ParDo.of(new DoFn<JsonObject, T>....
    @ProcessElement void process(@Element Row row) {
   })

And Beam will automatically convert JsonObject to a Row for processing (you
aren't forced to do this of course - you can always ask for it as a
JsonObject).

The same is true for output. If you have a sink that takes in JsonObject
but the transform before it produces Row objects (for instance - because
the transform before it is Beam SQL), Beam can automatically convert Row
back to JsonObject for you.

All of this was detailed in the Schema doc I shared a few months ago. There
was a lot of discussion on that document from various parties, and some of
this API is a result of that discussion. This is also working in the branch
JB and I were working on, though not yet integrated back to master.

I would like to actually go further and make Row an interface and provide a
way to automatically put a Row interface on top of any other object (e.g.
JsonObject, Pojo, etc.) This won't change the way the user writes code, but
instead of Beam having to copy and convert at each stage (e.g. from
JsonObject to Row) it simply will create a Row object that uses the the
JsonObject as its underlying storage.

Reuven

On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Well, beam can implement a new mapper but it doesnt help for io. Most of
> modern backends will take json directly, even javax one and it must stay
> generic.
>
> Then since json to pojo mapping is already done a dozen of times, not sure
> it is worth it for now.
>
> Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :
>
>> We can do even better btw. Building a SchemaRegistry where automatic
>> conversions can be registered between schema and Java data types. With this
>> the user won't even need a DoFn to do the conversion.
>>
>> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> Checked out what has been done on schema model and think it is
>>> acceptable - regarding the json debate -  if
>>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>>
>>> High level, it is about providing a mainstream and not too impacting
>>> model OOTB and JSON seems the most valid option for now, at least for IO
>>> and some user transforms.
>>>
>>> Wdyt?
>>>
>>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com>
>>> a écrit :
>>>
>>>>  Can give it a try end of may, sure. (holidays and work constraints
>>>> will make it hard before).
>>>>
>>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>>
>>>>> Romain,
>>>>>
>>>>> I don't believe that JSON approach was investigated very thoroughIy. I
>>>>> mentioned few reasons which will make it not the best choice my opinion,
>>>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>>>
>>>>> Thank you,
>>>>> Anton
>>>>>
>>>>>
>>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>>
>>>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>>>> you're talking about javax.json), except maybe some similarities of the
>>>>>> API. Few reasons why JsonObject doesn't work:
>>>>>>
>>>>>>    - it is a Java EE API:
>>>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>>>       APIs for other languages but they might not necessarily carry the same
>>>>>>       semantics / APIs;
>>>>>>
>>>>>>
>>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>>
>>>>>>
>>>>>>    - It can change between Java versions;
>>>>>>
>>>>>> No, this is javaee ;).
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Current Beam java implementation is an experimental feature to
>>>>>>       identify what's needed from such API, in the end we might end up with
>>>>>>       something similar to JsonObject API, but likely not
>>>>>>
>>>>>>
>>>>>> I dont get that point as a blocker
>>>>>>
>>>>>>
>>>>>>    - ;
>>>>>>       - represents JSON, which is not an API but an object notation:
>>>>>>       - it is defined as unicode string in a certain format. If you
>>>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>>       represent an Avro object, if I'm reading it right;
>>>>>>
>>>>>>
>>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>>
>>>>>>
>>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>>       - doesn't define date/time types;
>>>>>>       - doesn't allow extending JSON type system at all;
>>>>>>
>>>>>>
>>>>>> That is why you need a metada object, or simpler, a schema with that
>>>>>> data. Json or beam record doesnt help here and you end up on the same
>>>>>> outcome if you think about it.
>>>>>>
>>>>>>
>>>>>>    - lacks schemas;
>>>>>>
>>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>>> alternative.
>>>>>>
>>>>>> You can definitely try loosen the requirements and define everything
>>>>>> in JSON in userland, but the point of Row/Schema is to avoid it and define
>>>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>>>> beam SDKs.
>>>>>>
>>>>>>
>>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>>> support through jsonb.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>>
>>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>>
>>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>>
>>>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>>> generically for any schema.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>
>>>>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>>>>> avro could be another one.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <
>>>>>>>>> rmannibucau@gmail.com> a écrit:
>>>>>>>>>>
>>>>>>>>>> Hmm,
>>>>>>>>>>
>>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Romain Manni-Bucau
>>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>>
>>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi Ismael
>>>>>>>>>>>
>>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>>
>>>>>>>>>>> That will be part of schema support: generic record could be one
>>>>>>>>>>> of the payload with across schema.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> JB
>>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com> a
>>>>>>>>>>> écrit:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Ismaël
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>>
>>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>>
>>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>>
>>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Background
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>>
>>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>>
>>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>>
>>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>>
>>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Status
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>>
>>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>>
>>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>>
>>>>>>>>>>>>>  logic.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   example;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   test;
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  JSON
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>>
>>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Links
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>>
>>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>>
>>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Anton
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>

Re: Beam SQL Improvements

Posted by Romain Manni-Bucau <rm...@gmail.com>.
Well, beam can implement a new mapper but it doesnt help for io. Most of
modern backends will take json directly, even javax one and it must stay
generic.

Then since json to pojo mapping is already done a dozen of times, not sure
it is worth it for now.

Le mar. 22 mai 2018 20:27, Reuven Lax <re...@google.com> a écrit :

> We can do even better btw. Building a SchemaRegistry where automatic
> conversions can be registered between schema and Java data types. With this
> the user won't even need a DoFn to do the conversion.
>
> On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> Checked out what has been done on schema model and think it is acceptable
>> - regarding the json debate -  if
>> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>>
>> High level, it is about providing a mainstream and not too impacting
>> model OOTB and JSON seems the most valid option for now, at least for IO
>> and some user transforms.
>>
>> Wdyt?
>>
>> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com> a
>> écrit :
>>
>>>  Can give it a try end of may, sure. (holidays and work constraints will
>>> make it hard before).
>>>
>>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>>
>>>> Romain,
>>>>
>>>> I don't believe that JSON approach was investigated very thoroughIy. I
>>>> mentioned few reasons which will make it not the best choice my opinion,
>>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>>
>>>> Thank you,
>>>> Anton
>>>>
>>>>
>>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>>
>>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>>> you're talking about javax.json), except maybe some similarities of the
>>>>> API. Few reasons why JsonObject doesn't work:
>>>>>
>>>>>    - it is a Java EE API:
>>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>>       APIs for other languages but they might not necessarily carry the same
>>>>>       semantics / APIs;
>>>>>
>>>>>
>>>>> Not a big deal I think. At least not a technical blocker.
>>>>>
>>>>>
>>>>>    - It can change between Java versions;
>>>>>
>>>>> No, this is javaee ;).
>>>>>
>>>>>
>>>>>
>>>>>    - Current Beam java implementation is an experimental feature to
>>>>>       identify what's needed from such API, in the end we might end up with
>>>>>       something similar to JsonObject API, but likely not
>>>>>
>>>>>
>>>>> I dont get that point as a blocker
>>>>>
>>>>>
>>>>>    - ;
>>>>>       - represents JSON, which is not an API but an object notation:
>>>>>       - it is defined as unicode string in a certain format. If you
>>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>>       represent an Avro object, if I'm reading it right;
>>>>>
>>>>>
>>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>>
>>>>>
>>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>>       - doesn't define date/time types;
>>>>>       - doesn't allow extending JSON type system at all;
>>>>>
>>>>>
>>>>> That is why you need a metada object, or simpler, a schema with that
>>>>> data. Json or beam record doesnt help here and you end up on the same
>>>>> outcome if you think about it.
>>>>>
>>>>>
>>>>>    - lacks schemas;
>>>>>
>>>>> Jsonschema are standard, widely spread and tooled compared to
>>>>> alternative.
>>>>>
>>>>> You can definitely try loosen the requirements and define everything
>>>>> in JSON in userland, but the point of Row/Schema is to avoid it and define
>>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>>> beam SDKs.
>>>>>
>>>>>
>>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>>> support through jsonb.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>>> different from a JsonObject which is an API without implementation (not
>>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>>
>>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>>
>>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>>
>>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>>> generically for any schema.
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <
>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>
>>>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>>>> avro could be another one.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rm...@gmail.com>
>>>>>>>> a écrit:
>>>>>>>>>
>>>>>>>>> Hmm,
>>>>>>>>>
>>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Romain Manni-Bucau
>>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>>
>>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>>>
>>>>>>>>>> Hi Ismael
>>>>>>>>>>
>>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>>
>>>>>>>>>> That will be part of schema support: generic record could be one
>>>>>>>>>> of the payload with across schema.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com> a
>>>>>>>>>> écrit:
>>>>>>>>>>>
>>>>>>>>>>> Hello Anton,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Ismaël
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>            Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>>
>>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>>
>>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>>
>>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>>
>>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Background
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>>
>>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>>
>>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>>
>>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>>           Element>,
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>>
>>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Status
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>>
>>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>>
>>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>>
>>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>>
>>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>>
>>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Java Beans
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>>
>>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>>
>>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>>
>>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>>
>>>>>>>>>>>>>  }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>>
>>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>>            class))
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>>
>>>>>>>>>>>>  logic.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Links
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   example;
>>>>>>>>>>>>
>>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>>
>>>>>>>>>>>>   test;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  JSON
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Schema
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .build();
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>>
>>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>>
>>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>>
>>>>>>>>>>>>>  type"));
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>>
>>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Links
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>>
>>>>>>>>>>>>   test/example;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Going Forward
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>>
>>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>>
>>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Thank you,
>>>>>>>>>>>>
>>>>>>>>>>>>  Anton
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>

Re: Beam SQL Improvements

Posted by Reuven Lax <re...@google.com>.
We can do even better btw. Building a SchemaRegistry where automatic
conversions can be registered between schema and Java data types. With this
the user won't even need a DoFn to do the conversion.

On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi guys,
>
> Checked out what has been done on schema model and think it is acceptable
> - regarding the json debate -  if
> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>
> High level, it is about providing a mainstream and not too impacting model
> OOTB and JSON seems the most valid option for now, at least for IO and some
> user transforms.
>
> Wdyt?
>
> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com> a
> écrit :
>
>>  Can give it a try end of may, sure. (holidays and work constraints will
>> make it hard before).
>>
>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>
>>> Romain,
>>>
>>> I don't believe that JSON approach was investigated very thoroughIy. I
>>> mentioned few reasons which will make it not the best choice my opinion,
>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>
>>> Thank you,
>>> Anton
>>>
>>>
>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>
>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>> you're talking about javax.json), except maybe some similarities of the
>>>> API. Few reasons why JsonObject doesn't work:
>>>>
>>>>    - it is a Java EE API:
>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>       APIs for other languages but they might not necessarily carry the same
>>>>       semantics / APIs;
>>>>
>>>>
>>>> Not a big deal I think. At least not a technical blocker.
>>>>
>>>>
>>>>    - It can change between Java versions;
>>>>
>>>> No, this is javaee ;).
>>>>
>>>>
>>>>
>>>>    - Current Beam java implementation is an experimental feature to
>>>>       identify what's needed from such API, in the end we might end up with
>>>>       something similar to JsonObject API, but likely not
>>>>
>>>>
>>>> I dont get that point as a blocker
>>>>
>>>>
>>>>    - ;
>>>>       - represents JSON, which is not an API but an object notation:
>>>>       - it is defined as unicode string in a certain format. If you
>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>       represent an Avro object, if I'm reading it right;
>>>>
>>>>
>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>
>>>>
>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>       - doesn't define date/time types;
>>>>       - doesn't allow extending JSON type system at all;
>>>>
>>>>
>>>> That is why you need a metada object, or simpler, a schema with that
>>>> data. Json or beam record doesnt help here and you end up on the same
>>>> outcome if you think about it.
>>>>
>>>>
>>>>    - lacks schemas;
>>>>
>>>> Jsonschema are standard, widely spread and tooled compared to
>>>> alternative.
>>>>
>>>> You can definitely try loosen the requirements and define everything in
>>>> JSON in userland, but the point of Row/Schema is to avoid it and define
>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>> beam SDKs.
>>>>
>>>>
>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>> support through jsonb.
>>>>
>>>>
>>>>
>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>> different from a JsonObject which is an API without implementation (not
>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>
>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>
>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>> generically for any schema.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>>> avro could be another one.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rm...@gmail.com>
>>>>>>> a écrit:
>>>>>>>>
>>>>>>>> Hmm,
>>>>>>>>
>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>
>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>>
>>>>>>>>> Hi Ismael
>>>>>>>>>
>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>
>>>>>>>>> That will be part of schema support: generic record could be one
>>>>>>>>> of the payload with across schema.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com> a
>>>>>>>>> écrit:
>>>>>>>>>>
>>>>>>>>>> Hello Anton,
>>>>>>>>>>
>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>            Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>
>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>
>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>
>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>
>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Background
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>
>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>
>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>
>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>           Element>,
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>
>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Status
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>
>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>
>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Java Beans
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>
>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>
>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>
>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>
>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>
>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>            class))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>
>>>>>>>>>>>  logic.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   example;
>>>>>>>>>>>
>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>
>>>>>>>>>>>   test;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  JSON
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>
>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema
>>>>>>>>>>>>
>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>
>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>
>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>
>>>>>>>>>>>>  .build();
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>
>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>
>>>>>>>>>>>>  type"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>
>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>
>>>>>>>>>>>   test/example;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Going Forward
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>
>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>
>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Thank you,
>>>>>>>>>>>
>>>>>>>>>>>  Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>

Re: Beam SQL Improvements

Posted by Kenneth Knowles <kl...@google.com>.
Yea, I'm sure if you took on BEAM-4381 some folks would find it useful.

Kenn

On Tue, May 22, 2018 at 10:13 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi guys,
>
> Checked out what has been done on schema model and think it is acceptable
> - regarding the json debate -  if
> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>
> High level, it is about providing a mainstream and not too impacting model
> OOTB and JSON seems the most valid option for now, at least for IO and some
> user transforms.
>
> Wdyt?
>
> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rm...@gmail.com> a
> écrit :
>
>>  Can give it a try end of may, sure. (holidays and work constraints will
>> make it hard before).
>>
>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>
>>> Romain,
>>>
>>> I don't believe that JSON approach was investigated very thoroughIy. I
>>> mentioned few reasons which will make it not the best choice my opinion,
>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>
>>> Thank you,
>>> Anton
>>>
>>>
>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>
>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>> you're talking about javax.json), except maybe some similarities of the
>>>> API. Few reasons why JsonObject doesn't work:
>>>>
>>>>    - it is a Java EE API:
>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>       APIs for other languages but they might not necessarily carry the same
>>>>       semantics / APIs;
>>>>
>>>>
>>>> Not a big deal I think. At least not a technical blocker.
>>>>
>>>>
>>>>    - It can change between Java versions;
>>>>
>>>> No, this is javaee ;).
>>>>
>>>>
>>>>
>>>>    - Current Beam java implementation is an experimental feature to
>>>>       identify what's needed from such API, in the end we might end up with
>>>>       something similar to JsonObject API, but likely not
>>>>
>>>>
>>>> I dont get that point as a blocker
>>>>
>>>>
>>>>    - ;
>>>>       - represents JSON, which is not an API but an object notation:
>>>>       - it is defined as unicode string in a certain format. If you
>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>>>       represent an Avro object, if I'm reading it right;
>>>>
>>>>
>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>
>>>>
>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>       - doesn't define date/time types;
>>>>       - doesn't allow extending JSON type system at all;
>>>>
>>>>
>>>> That is why you need a metada object, or simpler, a schema with that
>>>> data. Json or beam record doesnt help here and you end up on the same
>>>> outcome if you think about it.
>>>>
>>>>
>>>>    - lacks schemas;
>>>>
>>>> Jsonschema are standard, widely spread and tooled compared to
>>>> alternative.
>>>>
>>>> You can definitely try loosen the requirements and define everything in
>>>> JSON in userland, but the point of Row/Schema is to avoid it and define
>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>> beam SDKs.
>>>>
>>>>
>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>> support through jsonb.
>>>>
>>>>
>>>>
>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>> different from a JsonObject which is an API without implementation (not
>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>
>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>
>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>> generically for any schema.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>>> avro could be another one.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rm...@gmail.com>
>>>>>>> a écrit:
>>>>>>>>
>>>>>>>> Hmm,
>>>>>>>>
>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>
>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <jb...@nanthrax.net>:
>>>>>>>>
>>>>>>>>> Hi Ismael
>>>>>>>>>
>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>
>>>>>>>>> That will be part of schema support: generic record could be one
>>>>>>>>> of the payload with across schema.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < iemejia@gmail.com> a
>>>>>>>>> écrit:
>>>>>>>>>>
>>>>>>>>>> Hello Anton,
>>>>>>>>>>
>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>            Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>>>
>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier to use.
>>>>>>>>>>>
>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>>>>>>
>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and incomplete
>>>>>>>>>>>
>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Background
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means that
>>>>>>>>>>>
>>>>>>>>>>>  users need to convert whatever PCollection elements they have to Rows before
>>>>>>>>>>>
>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a Schema and
>>>>>>>>>>>
>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>           Element>,
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  The improvements described here are an attempt to reduce this overhead for
>>>>>>>>>>>
>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Status
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to Rows.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>
>>>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>>>
>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Java Beans
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>>>
>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>
>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>
>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>
>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>
>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>            class))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>>>>>>
>>>>>>>>>>>  logic.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   example;
>>>>>>>>>>>
>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>
>>>>>>>>>>>   test;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  JSON
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>
>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema
>>>>>>>>>>>>
>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>
>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>
>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>
>>>>>>>>>>>>  .build();
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>
>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>>>>>>
>>>>>>>>>>>>  type"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>>>
>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>
>>>>>>>>>>>   test/example;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Going Forward
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>
>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>
>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Thank you,
>>>>>>>>>>>
>>>>>>>>>>>  Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>