You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Jean-Baptiste Onofré <jb...@nanthrax.net> on 2018/03/05 04:24:07 UTC

Re: Schema-Aware PCollections revisited

Hi Reuven,

I revive this discussion as I think it would be a great addition.

We had discussion on the fly, but I think now, as base for discussion, it would
be great to have a feature branch where we can start some sketch/impl and discuss.

@Reuven, did you start a PoC with what you proposed:
- SchemaCoder
- SchemaRegistry
- @FieldAccess on DoFn
- Select.fields PTransform
?

If not, I'm volunteer to start the branch and start to sketch.

Thoughts ?

Regards
JB

On 02/04/2018 08:23 PM, Reuven Lax wrote:
> Cool, let's chat about this on slack for a bit (which I realized I've been
> signed out of for some time).
> 
> Reuven
> 
> On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     Sorry guys, I was off today. Happy to be part of the party too ;)
> 
>     Regards
>     JB
> 
>     On 02/04/2018 06:19 PM, Reuven Lax wrote:
>     > Romain, since you're interested maybe the two of us should put together a
>     > proposal for how to set this things (hints, schema) on PCollections? I don't
>     > think it'll be hard - the previous list thread on hints already agreed on a
>     > general approach, and we would just need to flesh it out.
>     >
>     > BTW in the past when I looked, Json schemas seemed to have some odd limitations
>     > inherited from Javascript (e.g. no distinction between integer and
>     > floating-point types). Is that still true?
>     >
>     > Reuven
>     >
>     > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <rmannibucau@gmail.com <ma...@gmail.com>
>     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >
>     >
>     >     2018-02-04 17:53 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >
>     >
>     >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>     >         <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >
>     >             2018-02-04 17:37 GMT+01:00 Reuven Lax <relax@google.com <ma...@google.com>
>     >             <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >                 I'm not sure where proto comes from here. Proto is one example
>     >                 of a type that has a schema, but only one example.
>     >
>     >                 1. In the initial prototype I want to avoid modifying the
>     >                 PCollection API. So I think it's best to create a special
>     >                 SchemaCoder, and pass the schema into this coder. Later we
>     might
>     >                 targeted APIs for this instead of going through a coder.
>     >                 1.a I don't see what hints have to do with this? 
>     >
>     >
>     >             Hints are a way to replace the new API and unify the way to pass
>     >             metadata in beam instead of adding a new custom way each time.
>     >
>     >
>     >         I don't think schema is a hint. But I hear what your saying - hint
>     is a
>     >         type of PCollection metadata as is schema, and we should have a
>     unified
>     >         API for setting such metadata. 
>     >
>     >
>     >     :), Ismael pointed me out earlier this week that "hint" had an old meaning
>     >     in beam. My usage is purely the one done in most EE spec (your
>     "metadata" in
>     >     previous answer). But guess we are aligned on the meaning now, just wanted
>     >     to be sure.
>     >      
>     >
>     >          
>     >
>     >              
>     >
>     >
>     >                 2. BeamSQL already has a generic record type which fits
>     this use
>     >                 case very well (though we might modify it). However as
>     mentioned
>     >                 in the doc, the user is never forced to use this generic
>     record
>     >                 type.
>     >
>     >
>     >             Well yes and not. A type already exists but 1. it is very strictly
>     >             limited (flat/columns only which is very few of what big data SQL
>     >             can do) and 2. it must be aligned on the converge of generic data
>     >             the schema will bring (really read "aligned" as "dropped in favor
>     >             of" - deprecated being a smooth way to do it).
>     >
>     >
>     >         As I said the existing class needs to be modified and extended,
>     and not
>     >         just for this schema us was. It was meant to represent Calcite SQL
>     rows,
>     >         but doesn't quite even do that yet (Calcite supports nested rows).
>     >         However I think it's the right basis to start from.
>     >
>     >
>     >     Agree on the state. Current impl issues I hit (additionally to the nested
>     >     support which would require by itself a kind of visitor solution) are the
>     >     fact to own the schema in the record and handle field by field the
>     >     serialization instead of as a whole which is how it would be handled
>     with a
>     >     schema IMHO.
>     >
>     >     Concretely what I don't want is to do a PoC which works - they all work
>     >     right? and integrate to beam without thinking to a global solution for
>     this
>     >     generic record issue and its schema standardization. This is where
>     Json(-P)
>     >     has a lot of value IMHO but requires a bit more love than just adding
>     schema
>     >     in the model.
>     >      
>     >
>     >          
>     >
>     >
>     >             So long story short the main work of this schema track is not only
>     >             on using schema in runners and other ways but also starting to
>     make
>     >             beam consistent with itself which is probably the most important
>     >             outcome since it is the user facing side of this work.
>     >              
>     >
>     >
>     >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
>     >                 <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >                     @Reuven: is the proto only about passing schema or also the
>     >                     generic type?
>     >
>     >                     There are 2.5 topics to solve this issue:
>     >
>     >                     1. How to pass schema
>     >                     1.a. hints?
>     >                     2. What is the generic record type associated to a schema
>     >                     and how to express a schema relatively to it
>     >
>     >                     I would be happy to help on 1.a and 2 somehow if you need.
>     >
>     >                     Le 4 févr. 2018 03:30, "Reuven Lax" <relax@google.com <ma...@google.com>
>     >                     <mailto:relax@google.com <ma...@google.com>>> a
>     écrit :
>     >
>     >                         One more thing. If anyone here has experience with
>     >                         various OSS metadata stores (e.g. Kafka Schema Registry
>     >                         is one example), would you like to collaborate on
>     >                         implementation? I want to make sure that source schemas
>     >                         can be stored in a variety of OSS metadata stores, and
>     >                         be easily pulled into a Beam pipeline.
>     >
>     >                         Reuven
>     >
>     >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>     >                         <relax@google.com <ma...@google.com> <mailto:relax@google.com
>     <ma...@google.com>>> wrote:
>     >
>     >                             Hi all,
>     >
>     >                             If there are no concerns, I would like to start
>     >                             working on a prototype. It's just a prototype, so I
>     >                             don't think it will have the final API (e.g. for the
>     >                             prototype I'm going to avoid change the API of
>     >                             PCollection, and use a "special" Coder instead).
>     >                             Also even once we go beyond prototype, it will be
>     >                             @Experimental for some time, so the API will not be
>     >                             fixed in stone.
>     >
>     >                             Any more comments on this approach before we start
>     >                             implementing a prototype?
>     >
>     >                             Reuven
>     >
>     >                             On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau
>     >                             <rmannibucau@gmail.com <ma...@gmail.com>
>     >                             <mailto:rmannibucau@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >                                 If you need help on the json part I'm happy to
>     >                                 help. To give a few hints on what is very
>     >                                 doable: we can add an avro module to johnzon
>     >                                 (asf json{p,b} impl) to back jsonp by avro
>     >                                 (guess it will be one of the first to be asked)
>     >                                 for instance.
>     >
>     >
>     >                                 Romain Manni-Bucau
>     >                                 @rmannibucau
>     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>> |
>     >                                  Blog <https://rmannibucau.metawerx.net/
>     <https://rmannibucau.metawerx.net/>> | Old
>     >                                 Blog <http://rmannibucau.wordpress.com
>     <http://rmannibucau.wordpress.com>> | Github
>     >                                 <https://github.com/rmannibucau
>     <https://github.com/rmannibucau>> | LinkedIn
>     >                                 <https://www.linkedin.com/in/rmannibucau
>     <https://www.linkedin.com/in/rmannibucau>>
>     >
>     >                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
>     >                                 <relax@google.com
>     <ma...@google.com> <mailto:relax@google.com <ma...@google.com>>>:
>     >
>     >                                     Agree. The initial implementation will be a
>     >                                     prototype.
>     >
>     >                                     On Wed, Jan 31, 2018 at 12:21 PM,
>     >                                     Jean-Baptiste Onofré <jb@nanthrax.net <ma...@nanthrax.net>
>     >                                     <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>> wrote:
>     >
>     >                                         Hi Reuven,
>     >
>     >                                         Agree to be able to describe the
>     schema
>     >                                         with different format. The good point
>     >                                         about json schemas is that they are
>     >                                         described by a spec. My point is
>     also to
>     >                                         avoid the reinvent the wheel. Just an
>     >                                         abstract to be able to use Avro, Json,
>     >                                         Calcite, custom schema descriptors
>     would
>     >                                         be great.
>     >
>     >                                         Using coder to describe a schema
>     sounds
>     >                                         like a smart move to implement
>     quickly.
>     >                                         However, it has to be clear in term of
>     >                                         documentation to avoid "side
>     effect". I
>     >                                         still think PCollection.setSchema() is
>     >                                         better: it should be metadata (or hint
>     >                                         ;))) on the PCollection.
>     >
>     >                                         Regards
>     >                                         JB
>     >
>     >                                         On 31/01/2018 20:16, Reuven Lax wrote:
>     >
>     >                                             As to the question of how a schema
>     >                                             should be specified, I want to
>     >                                             support several common schema
>     >                                             formats. So if a user has a Json
>     >                                             schema, or an Avro schema, or a
>     >                                             Calcite schema, etc. there
>     should be
>     >                                             adapters that allow setting a
>     schema
>     >                                             from any of them. I don't think we
>     >                                             should prefer one over the other.
>     >                                             While Romain is right that many
>     >                                             people know Json, I think far
>     fewer
>     >                                             people know Json schemas.
>     >
>     >                                             Agree, schemas should not be
>     >                                             enforced (for one thing, that
>     >                                             wouldn't be backwards
>     compatible!).
>     >                                             I think for the initial
>     prototype I
>     >                                             will probably use a special
>     coder to
>     >                                             represent the schema (with
>     setSchema
>     >                                             an option on the coder), largely
>     >                                             because it doesn't require
>     modifying
>     >                                             PCollection. However I think
>     longer
>     >                                             term a schema should be an
>     optional
>     >                                             piece of metadata on the
>     PCollection
>     >                                             object. Similar to the previous
>     >                                             discussion about "hints," I think
>     >                                             this can be set on the producing
>     >                                             PTransform, and a SetSchema
>     >                                             PTransform will allow attaching a
>     >                                             schema to any PCollection (i.e.
>     >                                             pc.apply(SetSchema.of(schema))).
>     >                                             This part isn't designed yet,
>     but I
>     >                                             think schema should be similar to
>     >                                             hints, it's just another piece of
>     >                                             metadata on the PCollection
>     (though
>     >                                             something interpreted by the
>     model,
>     >                                             where hints are interpreted by the
>     >                                             runner)
>     >
>     >                                             Reuven
>     >
>     >                                             On Tue, Jan 30, 2018 at 1:37 AM,
>     >                                             Jean-Baptiste Onofré
>     >                                             <jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>>> wrote:
>     >
>     >                                                 Hi,
>     >
>     >                                                 I think we should avoid to mix
>     >                                             two things in the discussion
>     (and so
>     >                                                 the document):
>     >
>     >                                                 1. The element of the
>     collection
>     >                                             and the schema itself are two
>     >                                                 different things.
>     >                                                 By essence, Beam should not
>     >                                             enforce any schema. That's why
>     I think
>     >                                                 it's a good
>     >                                                 idea to set the schema
>     >                                             optionally on the PCollection
>     >                                                 (pcollection.setSchema()).
>     >
>     >                                                 2. From point 1 comes two
>     >                                             questions: how do we represent a
>     >                                             schema ?
>     >                                                 How can we
>     >                                                 leverage the schema to
>     simplify
>     >                                             the serialization of the
>     element in the
>     >                                                 PCollection and query ? These
>     >                                             two questions are not directly
>     related.
>     >
>     >                                                   2.1 How do we represent
>     the schema
>     >                                                 Json Schema is a very
>     >                                             interesting idea. It could be an
>     >                                             abstract and
>     >                                                 other
>     >                                                 providers, like Avro, can be
>     >                                             bind on it. It's part of the json
>     >                                                 processing spec
>     >                                                 (javax).
>     >
>     >                                                   2.2. How do we leverage the
>     >                                             schema for query and serialization
>     >                                                 Also in the spec, json pointer
>     >                                             is interesting for the querying.
>     >                                                 Regarding the
>     >                                                 serialization, jackson or
>     other
>     >                                             data binder can be used.
>     >
>     >                                                 It's still rough ideas in my
>     >                                             mind, but I like Romain's idea
>     about
>     >                                                 json-p usage.
>     >
>     >                                                 Once 2.3.0 release is out, I
>     >                                             will start to update the
>     document with
>     >                                                 those ideas,
>     >                                                 and PoC.
>     >
>     >                                                 Thanks !
>     >                                                 Regards
>     >                                                 JB
>     >
>     >                                                 On 01/30/2018 08:42 AM, Romain
>     >                                             Manni-Bucau wrote:
>     >                                                 >
>     >                                                 >
>     >                                                 > Le 30 janv. 2018 01:09,
>     >                                             "Reuven Lax" <relax@google.com
>     <ma...@google.com>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>>>
>     >                                                  >
>     <mailto:relax@google.com <ma...@google.com>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >                                             <mailto:relax@google.com
>     <ma...@google.com>>>>> a écrit :
>     >                                                 >
>     >                                                 >
>     >                                                 >
>     >                                                 >     On Mon, Jan 29, 2018 at
>     >                                             12:17 PM, Romain Manni-Bucau
>     >                                             <rmannibucau@gmail.com
>     <ma...@gmail.com>
>     >                                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>
>     >                                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>
>     >                                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>>
>     >                                                  >   
>     >                                              <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>
>     >                                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>
>     >
>     >                                                
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >                                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>>>> wrote:
>     >                                                  >
>     >                                                  >         Hi
>     >                                                  >
>     >                                                  >         I have some
>     questions
>     >                                             on this: how hierarchic schemas
>     >                                                 would work? Seems
>     >                                                  >         it is not really
>     >                                             supported by the ecosystem (out of
>     >                                                 custom stuff) :(.
>     >                                                  >         How would it
>     >                                             integrate smoothly with other
>     >                                             generic record
>     >                                                 types - N bridges?
>     >                                                  >
>     >                                                  >
>     >                                                  >     Do you mean nested
>     >                                             schemas? What do you mean here?
>     >                                                  >
>     >                                                  >
>     >                                                  > Yes, sorry - wrote the mail
>     >                                             too late ;). Was hierarchic
>     data and
>     >                                                 nested schemas.
>     >                                                  >
>     >                                                  >
>     >                                                  >         Concretely I wonder
>     >                                             if using json API couldnt be
>     >                                                 beneficial: json-p is a
>     >                                                  >         nice generic
>     >                                             abstraction with a built in
>     querying
>     >                                                 mecanism (jsonpointer)
>     >                                                  >         but no actual
>     >                                             serialization (even if json and
>     >                                             binary json
>     >                                                 are very
>     >                                                  >         natural). The big
>     >                                             advantage is to have a well known
>     >                                                 ecosystem - who
>     >                                                  >         doesnt know json
>     >                                             today? - that beam can reuse
>     for free:
>     >                                                 JsonObject
>     >                                                  >         (guess we dont want
>     >                                             JsonValue abstraction) for the
>     record
>     >                                                 type,
>     >                                                  >         jsonschema standard
>     >                                             for the schema, jsonpointer
>     for the
>     >                                                  >       
>      delection/projection
>     >                                             etc... It doesnt enforce the
>     actual
>     >                                                 serialization
>     >                                                  >         (json, smile, avro,
>     >                                             ...) but provide an expressive and
>     >                                                 alread known API
>     >                                                  >         so i see it as
>     a big
>     >                                             win-win for users (no need to
>     learn
>     >                                                 a new API and
>     >                                                  >         use N bridges
>     in all
>     >                                             ways) and beam (impls are here and
>     >                                                 API design
>     >                                                  >         already thought).
>     >                                                  >
>     >                                                  >
>     >                                                  >     I assume you're talking
>     >                                             about the API for setting schemas,
>     >                                                 not using them.
>     >                                                  >     Json has many downsides
>     >                                             and I'm not sure it's true that
>     >                                                 everyone knows it;
>     >                                                  >     there are also
>     competing
>     >                                             schema APIs, such as Avro etc..
>     >                                                 However I think we
>     >                                                  >     should give Json a fair
>     >                                             evaluation before dismissing it.
>     >                                                  >
>     >                                                  >
>     >                                                  > It is a wider topic than
>     >                                             schema. Actually schema are
>     not the
>     >                                                 first citizen but a
>     >                                                  > generic data representation
>     >                                             is. That is where json hits almost
>     >                                                 any other API.
>     >                                                  > Then, when it comes to
>     >                                             schema, json has a standard
>     for that
>     >                                             so we
>     >                                                 are all good.
>     >                                                  >
>     >                                                  > Also json has a good
>     indexing
>     >                                             API compared to alternatives which
>     >                                                 are sometimes a
>     >                                                  > bit faster - for noop
>     >                                             transforms - but are hardly usable
>     >                                             or make
>     >                                                 the code not
>     >                                                  > that readable.
>     >                                                  >
>     >                                                  > Avro is a nice
>     competitor but
>     >                                             it is compatible - actually
>     avro is
>     >                                                 json driven by
>     >                                                  > design - but its API is far
>     >                                             to be that easy due to its schema
>     >                                                 enforcement which
>     >                                                  > is heavvvyyy and worse
>     is you
>     >                                             cant work with avro without a
>     >                                                 schema. Json would
>     >                                                  > allow to reconciliate the
>     >                                             dynamic and static cases since
>     the job
>     >                                                 wouldnt change
>     >                                                  > except the setschema.
>     >                                                  >
>     >                                                  > That is why I think
>     json is a
>     >                                             good compromise and having a
>     >                                                 standard API for it
>     >                                                  > allow to fully
>     customize the
>     >                                             imol as will if needed - even
>     using
>     >                                                 avro or protobuf.
>     >                                                  >
>     >                                                  > Side note on beam api:
>     i dont
>     >                                             think it is good to use a main API
>     >                                                 for runner
>     >                                                  > optimization. It enforces
>     >                                             something to be shared on all
>     runners
>     >                                                 but not widely
>     >                                                  > usable. It is also
>     misleading
>     >                                             for users. Would you set a flink
>     >                                                 pipeline option
>     >                                                  > with dataflow? My proposal
>     >                                             here is to use hints -
>     properties -
>     >                                                 instead of
>     >                                                  > something hardly defined in
>     >                                             the API then standardize it if all
>     >                                                 runners support it.
>     >                                                  >
>     >                                                  >
>     >                                                  >
>     >                                                  >         Wdyt?
>     >                                                  >
>     >                                                  >         Le 29 janv. 2018
>     >                                             06:24, "Jean-Baptiste Onofré"
>     >                                                 <jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>>
>     >                                                  >       
>     >                                              <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>>>> a écrit :
>     >
>     >                                                  >
>     >                                                  >             Hi Reuven,
>     >                                                  >
>     >                                                  >             Thanks for the
>     >                                             update ! As I'm working with
>     you on
>     >                                                 this, I fully
>     >                                                  >             agree and great
>     >                                                  >             doc
>     gathering the
>     >                                             ideas.
>     >                                                  >
>     >                                                  >             It's clearly
>     >                                             something we have to add asap
>     in Beam,
>     >                                                 because it would
>     >                                                  >             allow new
>     >                                                  >             use cases
>     for our
>     >                                             users (in a simple way) and open
>     >                                                 new areas for the
>     >                                                  >             runners
>     >                                                  >             (for instance
>     >                                             dataframe support in the Spark
>     runner).
>     >                                                  >
>     >                                                  >             By the way,
>     while
>     >                                             ago, I created BEAM-3437 to track
>     >                                                 the PoC/PR
>     >                                                  >             around this.
>     >                                                  >
>     >                                                  >             Thanks !
>     >                                                  >
>     >                                                  >             Regards
>     >                                                  >             JB
>     >                                                  >
>     >                                                  >             On 01/29/2018
>     >                                             02:08 AM, Reuven Lax wrote:
>     >                                                  >             > Previously I
>     >                                             submitted a proposal for adding
>     >                                                 schemas as a
>     >                                                  >             first-class
>     >                                             concept on
>     >                                                  >             > Beam
>     >                                             PCollections. The proposal
>     >                                             engendered quite a
>     >                                                 bit of
>     >                                                  >             discussion
>     from the
>     >                                                  >             > community -
>     >                                             more discussion than I've seen
>     from
>     >                                                 almost any of our
>     >                                                  >             proposals to
>     >                                                  >             > date!
>     >                                                  >             >
>     >                                                  >             > Based on the
>     >                                             feedback and comments, I
>     reworked the
>     >                                                 proposal
>     >                                                  >             document
>     quite a
>     >                                                  >             > bit. It now
>     >                                             talks more explicitly about the
>     >                                                 different between
>     >                                                  >             dynamic schemas
>     >                                                  >             > (where the
>     >                                             schema is not fully not know at
>     >                                                 graph-creation time),
>     >                                                  >             and static
>     >                                                  >             > schemas
>     (which
>     >                                             are fully know at graph-creation
>     >                                                 time). Proposed
>     >                                                  >             APIs are more
>     >                                                  >             > fleshed
>     out now
>     >                                             (again thanks to feedback from
>     >                                                 community members),
>     >                                                  >             and the
>     >                                                  >             > document
>     talks
>     >                                             in more detail about evolving
>     schemas in
>     >                                                  >             long-running
>     >                                             streaming
>     >                                                  >             > pipelines.
>     >                                                  >             >
>     >                                                  >             > Please take a
>     >                                             look. I think this will be very
>     >                                                 valuable to Beam,
>     >                                                  >             and welcome any
>     >                                                  >             > feedback.
>     >                                                  >             >
>     >                                                  >             >
>     >                                                  >
>     >                                                
>     >                                           
>      https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>
>     >                                                
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>>
>     >                                                  >               
>     >                                           
>       <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>     >                                           
>      <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>>>
>     >                                                  >             >
>     >                                                  >             > Reuven
>     >                                                  >
>     >                                                  >             --
>     >                                                  >           
>      Jean-Baptiste Onofré
>     >                                                  > jbonofre@apache.org
>     <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>>
>     >                                                
>     <mailto:jbonofre@apache.org <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>>>
>     >                                                  > http://blog.nanthrax.net
>     >                                                  >             Talend -
>     >                                             http://www.talend.com
>     >                                                  >
>     >                                                  >
>     >                                                  >
>     >
>     >                                                 --
>     >                                                 Jean-Baptiste Onofré
>     >                                                 jbonofre@apache.org
>     <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>
>     >                                             <mailto:jbonofre@apache.org
>     <ma...@apache.org>>>
>     >                                                 http://blog.nanthrax.net
>     >                                                 Talend - http://www.talend.com
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
> 
>     --
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Posted by Reuven Lax <re...@google.com>.
Of course! I think some BeamSQL folks should be involved as well, as this
directly affects SQL work. Anton especially has expressed interest in Row
and schemas.

Reuven


On Mon, Mar 5, 2018 at 4:30 AM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> Cool,
>
> can I work with you on this (sharing a branch for instance) ?
>
> Thanks !
> Regards
> JB
>
> On 03/05/2018 01:01 PM, Reuven Lax wrote:
> > Yes, I do have a PoC in progress. The Beam Row class was being
> refactored, so I
> > paused to wait for that to finish.
> >
> >
> > On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré <jb@nanthrax.net
> > <ma...@nanthrax.net>> wrote:
> >
> >     Hi Reuven,
> >
> >     I revive this discussion as I think it would be a great addition.
> >
> >     We had discussion on the fly, but I think now, as base for
> discussion, it would
> >     be great to have a feature branch where we can start some
> sketch/impl and
> >     discuss.
> >
> >     @Reuven, did you start a PoC with what you proposed:
> >     - SchemaCoder
> >     - SchemaRegistry
> >     - @FieldAccess on DoFn
> >     - Select.fields PTransform
> >     ?
> >
> >     If not, I'm volunteer to start the branch and start to sketch.
> >
> >     Thoughts ?
> >
> >     Regards
> >     JB
> >
> >     On 02/04/2018 08:23 PM, Reuven Lax wrote:
> >     > Cool, let's chat about this on slack for a bit (which I realized
> I've been
> >     > signed out of for some time).
> >     >
> >     > Reuven
> >     >
> >     > On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     > <mailto:jb@nanthrax.net <ma...@nanthrax.net>>> wrote:
> >     >
> >     >     Sorry guys, I was off today. Happy to be part of the party too
> ;)
> >     >
> >     >     Regards
> >     >     JB
> >     >
> >     >     On 02/04/2018 06:19 PM, Reuven Lax wrote:
> >     >     > Romain, since you're interested maybe the two of us should
> put
> >     together a
> >     >     > proposal for how to set this things (hints, schema) on
> PCollections?
> >     I don't
> >     >     > think it'll be hard - the previous list thread on hints
> already
> >     agreed on a
> >     >     > general approach, and we would just need to flesh it out.
> >     >     >
> >     >     > BTW in the past when I looked, Json schemas seemed to have
> some odd
> >     limitations
> >     >     > inherited from Javascript (e.g. no distinction between
> integer and
> >     >     > floating-point types). Is that still true?
> >     >     >
> >     >     > Reuven
> >     >     >
> >     >     > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau
> >     <rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >     >     >
> >     >     >
> >     >     >
> >     >     >     2018-02-04 17:53 GMT+01:00 Reuven Lax <relax@google.com
> >     <ma...@google.com> <mailto:relax@google.com <mailto:
> relax@google.com>>
> >     >     >     <mailto:relax@google.com <ma...@google.com>
> >     <mailto:relax@google.com <ma...@google.com>>>>:
> >     >     >
> >     >     >
> >     >     >
> >     >     >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
> >     >     >         <rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
> >
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >     >     >
> >     >     >
> >     >     >             2018-02-04 17:37 GMT+01:00 Reuven Lax <
> relax@google.com
> >     <ma...@google.com> <mailto:relax@google.com <mailto:
> relax@google.com>>
> >     >     >             <mailto:relax@google.com <mailto:
> relax@google.com>
> >     <mailto:relax@google.com <ma...@google.com>>>>:
> >     >     >
> >     >     >                 I'm not sure where proto comes from here.
> Proto is
> >     one example
> >     >     >                 of a type that has a schema, but only one
> example.
> >     >     >
> >     >     >                 1. In the initial prototype I want to avoid
> >     modifying the
> >     >     >                 PCollection API. So I think it's best to
> create a
> >     special
> >     >     >                 SchemaCoder, and pass the schema into this
> coder.
> >     Later we
> >     >     might
> >     >     >                 targeted APIs for this instead of going
> through a coder.
> >     >     >                 1.a I don't see what hints have to do with
> this?
> >     >     >
> >     >     >
> >     >     >             Hints are a way to replace the new API and unify
> the way
> >     to pass
> >     >     >             metadata in beam instead of adding a new custom
> way each
> >     time.
> >     >     >
> >     >     >
> >     >     >         I don't think schema is a hint. But I hear what your
> saying
> >     - hint
> >     >     is a
> >     >     >         type of PCollection metadata as is schema, and we
> should have a
> >     >     unified
> >     >     >         API for setting such metadata.
> >     >     >
> >     >     >
> >     >     >     :), Ismael pointed me out earlier this week that "hint"
> had an
> >     old meaning
> >     >     >     in beam. My usage is purely the one done in most EE spec
> (your
> >     >     "metadata" in
> >     >     >     previous answer). But guess we are aligned on the
> meaning now,
> >     just wanted
> >     >     >     to be sure.
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >                 2. BeamSQL already has a generic record type
> which fits
> >     >     this use
> >     >     >                 case very well (though we might modify it).
> However as
> >     >     mentioned
> >     >     >                 in the doc, the user is never forced to use
> this generic
> >     >     record
> >     >     >                 type.
> >     >     >
> >     >     >
> >     >     >             Well yes and not. A type already exists but 1.
> it is
> >     very strictly
> >     >     >             limited (flat/columns only which is very few of
> what big
> >     data SQL
> >     >     >             can do) and 2. it must be aligned on the
> converge of
> >     generic data
> >     >     >             the schema will bring (really read "aligned" as
> "dropped
> >     in favor
> >     >     >             of" - deprecated being a smooth way to do it).
> >     >     >
> >     >     >
> >     >     >         As I said the existing class needs to be modified
> and extended,
> >     >     and not
> >     >     >         just for this schema us was. It was meant to
> represent
> >     Calcite SQL
> >     >     rows,
> >     >     >         but doesn't quite even do that yet (Calcite supports
> nested
> >     rows).
> >     >     >         However I think it's the right basis to start from.
> >     >     >
> >     >     >
> >     >     >     Agree on the state. Current impl issues I hit
> (additionally to
> >     the nested
> >     >     >     support which would require by itself a kind of visitor
> >     solution) are the
> >     >     >     fact to own the schema in the record and handle field by
> field the
> >     >     >     serialization instead of as a whole which is how it
> would be handled
> >     >     with a
> >     >     >     schema IMHO.
> >     >     >
> >     >     >     Concretely what I don't want is to do a PoC which works
> - they
> >     all work
> >     >     >     right? and integrate to beam without thinking to a global
> >     solution for
> >     >     this
> >     >     >     generic record issue and its schema standardization.
> This is where
> >     >     Json(-P)
> >     >     >     has a lot of value IMHO but requires a bit more love
> than just
> >     adding
> >     >     schema
> >     >     >     in the model.
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >             So long story short the main work of this schema
> track
> >     is not only
> >     >     >             on using schema in runners and other ways but
> also
> >     starting to
> >     >     make
> >     >     >             beam consistent with itself which is probably
> the most
> >     important
> >     >     >             outcome since it is the user facing side of this
> work.
> >     >     >
> >     >     >
> >     >     >
> >     >     >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain
> Manni-Bucau
> >     >     >                 <rmannibucau@gmail.com
> >     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
> >     <ma...@gmail.com>>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
> wrote:
> >     >     >
> >     >     >                     @Reuven: is the proto only about passing
> schema
> >     or also the
> >     >     >                     generic type?
> >     >     >
> >     >     >                     There are 2.5 topics to solve this issue:
> >     >     >
> >     >     >                     1. How to pass schema
> >     >     >                     1.a. hints?
> >     >     >                     2. What is the generic record type
> associated to
> >     a schema
> >     >     >                     and how to express a schema relatively
> to it
> >     >     >
> >     >     >                     I would be happy to help on 1.a and 2
> somehow if
> >     you need.
> >     >     >
> >     >     >                     Le 4 févr. 2018 03:30, "Reuven Lax"
> >     <relax@google.com <ma...@google.com> <mailto:relax@google.com
> >     <ma...@google.com>>
> >     >     >                     <mailto:relax@google.com
> >     <ma...@google.com> <mailto:relax@google.com
> >     <ma...@google.com>>>> a
> >     >     écrit :
> >     >     >
> >     >     >                         One more thing. If anyone here has
> >     experience with
> >     >     >                         various OSS metadata stores (e.g.
> Kafka
> >     Schema Registry
> >     >     >                         is one example), would you like to
> >     collaborate on
> >     >     >                         implementation? I want to make sure
> that
> >     source schemas
> >     >     >                         can be stored in a variety of OSS
> metadata
> >     stores, and
> >     >     >                         be easily pulled into a Beam
> pipeline.
> >     >     >
> >     >     >                         Reuven
> >     >     >
> >     >     >                         On Sat, Feb 3, 2018 at 6:28 PM,
> Reuven Lax
> >     >     >                         <relax@google.com <mailto:
> relax@google.com>
> >     <mailto:relax@google.com <ma...@google.com>> <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>> wrote:
> >     >     >
> >     >     >                             Hi all,
> >     >     >
> >     >     >                             If there are no concerns, I
> would like
> >     to start
> >     >     >                             working on a prototype. It's
> just a
> >     prototype, so I
> >     >     >                             don't think it will have the
> final API
> >     (e.g. for the
> >     >     >                             prototype I'm going to avoid
> change the
> >     API of
> >     >     >                             PCollection, and use a "special"
> Coder
> >     instead).
> >     >     >                             Also even once we go beyond
> prototype,
> >     it will be
> >     >     >                             @Experimental for some time, so
> the API
> >     will not be
> >     >     >                             fixed in stone.
> >     >     >
> >     >     >                             Any more comments on this
> approach
> >     before we start
> >     >     >                             implementing a prototype?
> >     >     >
> >     >     >                             Reuven
> >     >     >
> >     >     >                             On Wed, Jan 31, 2018 at 1:12 PM,
> Romain
> >     Manni-Bucau
> >     >     >                             <rmannibucau@gmail.com
> >     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
> >     <ma...@gmail.com>>
> >     >     >                             <mailto:rmannibucau@gmail.com
> >     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
> >     <ma...@gmail.com>>>> wrote:
> >     >     >
> >     >     >                                 If you need help on the json
> part
> >     I'm happy to
> >     >     >                                 help. To give a few hints on
> what is
> >     very
> >     >     >                                 doable: we can add an avro
> module to
> >     johnzon
> >     >     >                                 (asf json{p,b} impl) to back
> jsonp
> >     by avro
> >     >     >                                 (guess it will be one of the
> first
> >     to be asked)
> >     >     >                                 for instance.
> >     >     >
> >     >     >
> >     >     >                                 Romain Manni-Bucau
> >     >     >                                 @rmannibucau
> >     >     <https://twitter.com/rmannibucau <
> https://twitter.com/rmannibucau>> |
> >     >     >                                  Blog <
> https://rmannibucau.metawerx.net/
> >     >     <https://rmannibucau.metawerx.net/>> | Old
> >     >     >                                 Blog <
> http://rmannibucau.wordpress.com
> >     >     <http://rmannibucau.wordpress.com>> | Github
> >     >     >                                 <
> https://github.com/rmannibucau
> >     >     <https://github.com/rmannibucau>> | LinkedIn
> >     >     >                                 <
> https://www.linkedin.com/in/rmannibucau
> >     >     <https://www.linkedin.com/in/rmannibucau>>
> >     >     >
> >     >     >                                 2018-01-31 22:06 GMT+01:00
> Reuven Lax
> >     >     >                                 <relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>
> >     <mailto:relax@google.com <ma...@google.com> <mailto:
> relax@google.com
> >     <ma...@google.com>>>>:
> >     >     >
> >     >     >                                     Agree. The initial
> >     implementation will be a
> >     >     >                                     prototype.
> >     >     >
> >     >     >                                     On Wed, Jan 31, 2018 at
> 12:21 PM,
> >     >     >                                     Jean-Baptiste Onofré
> >     <jb@nanthrax.net <ma...@nanthrax.net> <mailto:jb@nanthrax.net
> >     <ma...@nanthrax.net>>
> >     >     >                                     <mailto:jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>> wrote:
> >     >     >
> >     >     >                                         Hi Reuven,
> >     >     >
> >     >     >                                         Agree to be able to
> describe the
> >     >     schema
> >     >     >                                         with different
> format. The
> >     good point
> >     >     >                                         about json schemas
> is that
> >     they are
> >     >     >                                         described by a spec.
> My point is
> >     >     also to
> >     >     >                                         avoid the reinvent
> the
> >     wheel. Just an
> >     >     >                                         abstract to be able
> to use
> >     Avro, Json,
> >     >     >                                         Calcite, custom
> schema
> >     descriptors
> >     >     would
> >     >     >                                         be great.
> >     >     >
> >     >     >                                         Using coder to
> describe a schema
> >     >     sounds
> >     >     >                                         like a smart move to
> implement
> >     >     quickly.
> >     >     >                                         However, it has to
> be clear
> >     in term of
> >     >     >                                         documentation to
> avoid "side
> >     >     effect". I
> >     >     >                                         still think
> >     PCollection.setSchema() is
> >     >     >                                         better: it should be
> >     metadata (or hint
> >     >     >                                         ;))) on the
> PCollection.
> >     >     >
> >     >     >                                         Regards
> >     >     >                                         JB
> >     >     >
> >     >     >                                         On 31/01/2018 20:16,
> Reuven
> >     Lax wrote:
> >     >     >
> >     >     >                                             As to the
> question of
> >     how a schema
> >     >     >                                             should be
> specified, I
> >     want to
> >     >     >                                             support several
> common
> >     schema
> >     >     >                                             formats. So if a
> user
> >     has a Json
> >     >     >                                             schema, or an
> Avro
> >     schema, or a
> >     >     >                                             Calcite schema,
> etc. there
> >     >     should be
> >     >     >                                             adapters that
> allow
> >     setting a
> >     >     schema
> >     >     >                                             from any of
> them. I
> >     don't think we
> >     >     >                                             should prefer
> one over
> >     the other.
> >     >     >                                             While Romain is
> right
> >     that many
> >     >     >                                             people know
> Json, I
> >     think far
> >     >     fewer
> >     >     >                                             people know Json
> schemas.
> >     >     >
> >     >     >                                             Agree, schemas
> should not be
> >     >     >                                             enforced (for
> one thing,
> >     that
> >     >     >                                             wouldn't be
> backwards
> >     >     compatible!).
> >     >     >                                             I think for the
> initial
> >     >     prototype I
> >     >     >                                             will probably
> use a special
> >     >     coder to
> >     >     >                                             represent the
> schema (with
> >     >     setSchema
> >     >     >                                             an option on the
> coder),
> >     largely
> >     >     >                                             because it
> doesn't require
> >     >     modifying
> >     >     >                                             PCollection.
> However I think
> >     >     longer
> >     >     >                                             term a schema
> should be an
> >     >     optional
> >     >     >                                             piece of
> metadata on the
> >     >     PCollection
> >     >     >                                             object. Similar
> to the
> >     previous
> >     >     >                                             discussion about
> >     "hints," I think
> >     >     >                                             this can be set
> on the
> >     producing
> >     >     >                                             PTransform, and
> a SetSchema
> >     >     >                                             PTransform will
> allow
> >     attaching a
> >     >     >                                             schema to any
> >     PCollection (i.e.
> >     >     >
> >      pc.apply(SetSchema.of(schema))).
> >     >     >                                             This part isn't
> designed
> >     yet,
> >     >     but I
> >     >     >                                             think schema
> should be
> >     similar to
> >     >     >                                             hints, it's just
> another
> >     piece of
> >     >     >                                             metadata on the
> PCollection
> >     >     (though
> >     >     >                                             something
> interpreted by the
> >     >     model,
> >     >     >                                             where hints are
> >     interpreted by the
> >     >     >                                             runner)
> >     >     >
> >     >     >                                             Reuven
> >     >     >
> >     >     >                                             On Tue, Jan 30,
> 2018 at
> >     1:37 AM,
> >     >     >                                             Jean-Baptiste
> Onofré
> >     >     >                                             <jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
> >     >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>
> >     >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
> >     >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>>> wrote:
> >     >     >
> >     >     >                                                 Hi,
> >     >     >
> >     >     >                                                 I think we
> should
> >     avoid to mix
> >     >     >                                             two things in
> the discussion
> >     >     (and so
> >     >     >                                                 the
> document):
> >     >     >
> >     >     >                                                 1. The
> element of the
> >     >     collection
> >     >     >                                             and the schema
> itself
> >     are two
> >     >     >                                                 different
> things.
> >     >     >                                                 By essence,
> Beam
> >     should not
> >     >     >                                             enforce any
> schema.
> >     That's why
> >     >     I think
> >     >     >                                                 it's a good
> >     >     >                                                 idea to set
> the schema
> >     >     >                                             optionally on the
> >     PCollection
> >     >     >
> >     (pcollection.setSchema()).
> >     >     >
> >     >     >                                                 2. From
> point 1
> >     comes two
> >     >     >                                             questions: how
> do we
> >     represent a
> >     >     >                                             schema ?
> >     >     >                                                 How can we
> >     >     >                                                 leverage the
> schema to
> >     >     simplify
> >     >     >                                             the
> serialization of the
> >     >     element in the
> >     >     >                                                 PCollection
> and
> >     query ? These
> >     >     >                                             two questions
> are not
> >     directly
> >     >     related.
> >     >     >
> >     >     >                                                   2.1 How do
> we
> >     represent
> >     >     the schema
> >     >     >                                                 Json Schema
> is a very
> >     >     >                                             interesting
> idea. It
> >     could be an
> >     >     >                                             abstract and
> >     >     >                                                 other
> >     >     >                                                 providers,
> like
> >     Avro, can be
> >     >     >                                             bind on it. It's
> part of
> >     the json
> >     >     >                                                 processing
> spec
> >     >     >                                                 (javax).
> >     >     >
> >     >     >                                                   2.2. How
> do we
> >     leverage the
> >     >     >                                             schema for query
> and
> >     serialization
> >     >     >                                                 Also in the
> spec,
> >     json pointer
> >     >     >                                             is interesting
> for the
> >     querying.
> >     >     >                                                 Regarding the
> >     >     >
> serialization,
> >     jackson or
> >     >     other
> >     >     >                                             data binder can
> be used.
> >     >     >
> >     >     >                                                 It's still
> rough
> >     ideas in my
> >     >     >                                             mind, but I like
> >     Romain's idea
> >     >     about
> >     >     >                                                 json-p usage.
> >     >     >
> >     >     >                                                 Once 2.3.0
> release
> >     is out, I
> >     >     >                                             will start to
> update the
> >     >     document with
> >     >     >                                                 those ideas,
> >     >     >                                                 and PoC.
> >     >     >
> >     >     >                                                 Thanks !
> >     >     >                                                 Regards
> >     >     >                                                 JB
> >     >     >
> >     >     >                                                 On
> 01/30/2018 08:42
> >     AM, Romain
> >     >     >                                             Manni-Bucau
> wrote:
> >     >     >                                                 >
> >     >     >                                                 >
> >     >     >                                                 > Le 30
> janv. 2018
> >     01:09,
> >     >     >                                             "Reuven Lax"
> >     <relax@google.com <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>>
> >     >     >                                                  >
> >     >     <mailto:relax@google.com <ma...@google.com>
> >     <mailto:relax@google.com <ma...@google.com>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>
> >     >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>>>> a
> écrit :
> >     >     >                                                 >
> >     >     >                                                 >
> >     >     >                                                 >
> >     >     >                                                 >     On
> Mon, Jan
> >     29, 2018 at
> >     >     >                                             12:17 PM, Romain
> Manni-Bucau
> >     >     >                                             <
> rmannibucau@gmail.com
> >     <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     >
> >      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> >     >     >
> >      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     >
> >      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <mailto:rmannibucau@gmail.com
> >>>>
> >     >     >                                                  >
> >     >     >
> >       <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     >
> >      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> >     >     >
> >     >     >
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
> >     >     >
> >      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>>>
> wrote:
> >     >     >                                                  >
> >     >     >                                                  >         Hi
> >     >     >                                                  >
> >     >     >                                                  >         I
> have some
> >     >     questions
> >     >     >                                             on this: how
> hierarchic
> >     schemas
> >     >     >                                                 would work?
> Seems
> >     >     >                                                  >
>  it is not
> >     really
> >     >     >                                             supported by the
> >     ecosystem (out of
> >     >     >                                                 custom
> stuff) :(.
> >     >     >                                                  >
>  How would it
> >     >     >                                             integrate
> smoothly with
> >     other
> >     >     >                                             generic record
> >     >     >                                                 types - N
> bridges?
> >     >     >                                                  >
> >     >     >                                                  >
> >     >     >                                                  >     Do
> you mean
> >     nested
> >     >     >                                             schemas? What do
> you
> >     mean here?
> >     >     >                                                  >
> >     >     >                                                  >
> >     >     >                                                  > Yes,
> sorry -
> >     wrote the mail
> >     >     >                                             too late ;). Was
> hierarchic
> >     >     data and
> >     >     >                                                 nested
> schemas.
> >     >     >                                                  >
> >     >     >                                                  >
> >     >     >                                                  >
> >      Concretely I wonder
> >     >     >                                             if using json
> API couldnt be
> >     >     >                                                 beneficial:
> json-p is a
> >     >     >                                                  >
>  nice generic
> >     >     >                                             abstraction with
> a built in
> >     >     querying
> >     >     >                                                 mecanism
> (jsonpointer)
> >     >     >                                                  >
>  but no actual
> >     >     >                                             serialization
> (even if
> >     json and
> >     >     >                                             binary json
> >     >     >                                                 are very
> >     >     >                                                  >
>  natural).
> >     The big
> >     >     >                                             advantage is to
> have a
> >     well known
> >     >     >                                                 ecosystem -
> who
> >     >     >                                                  >
>  doesnt
> >     know json
> >     >     >                                             today? - that
> beam can reuse
> >     >     for free:
> >     >     >                                                 JsonObject
> >     >     >                                                  >
>  (guess we
> >     dont want
> >     >     >                                             JsonValue
> abstraction)
> >     for the
> >     >     record
> >     >     >                                                 type,
> >     >     >                                                  >
> >      jsonschema standard
> >     >     >                                             for the schema,
> jsonpointer
> >     >     for the
> >     >     >                                                  >
> >     >      delection/projection
> >     >     >                                             etc... It doesnt
> enforce the
> >     >     actual
> >     >     >                                                 serialization
> >     >     >                                                  >
>  (json,
> >     smile, avro,
> >     >     >                                             ...) but provide
> an
> >     expressive and
> >     >     >                                                 alread known
> API
> >     >     >                                                  >
>  so i see
> >     it as
> >     >     a big
> >     >     >                                             win-win for
> users (no
> >     need to
> >     >     learn
> >     >     >                                                 a new API and
> >     >     >                                                  >
>  use N bridges
> >     >     in all
> >     >     >                                             ways) and beam
> (impls
> >     are here and
> >     >     >                                                 API design
> >     >     >                                                  >
>  already
> >     thought).
> >     >     >                                                  >
> >     >     >                                                  >
> >     >     >                                                  >     I
> assume
> >     you're talking
> >     >     >                                             about the API for
> >     setting schemas,
> >     >     >                                                 not using
> them.
> >     >     >                                                  >     Json
> has many
> >     downsides
> >     >     >                                             and I'm not sure
> it's
> >     true that
> >     >     >                                                 everyone
> knows it;
> >     >     >                                                  >     there
> are also
> >     >     competing
> >     >     >                                             schema APIs,
> such as
> >     Avro etc..
> >     >     >                                                 However I
> think we
> >     >     >                                                  >
>  should give
> >     Json a fair
> >     >     >

Re: Schema-Aware PCollections revisited

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Cool,

can I work with you on this (sharing a branch for instance) ?

Thanks !
Regards
JB

On 03/05/2018 01:01 PM, Reuven Lax wrote:
> Yes, I do have a PoC in progress. The Beam Row class was being refactored, so I
> paused to wait for that to finish.
> 
> 
> On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré <jb@nanthrax.net
> <ma...@nanthrax.net>> wrote:
> 
>     Hi Reuven,
> 
>     I revive this discussion as I think it would be a great addition.
> 
>     We had discussion on the fly, but I think now, as base for discussion, it would
>     be great to have a feature branch where we can start some sketch/impl and
>     discuss.
> 
>     @Reuven, did you start a PoC with what you proposed:
>     - SchemaCoder
>     - SchemaRegistry
>     - @FieldAccess on DoFn
>     - Select.fields PTransform
>     ?
> 
>     If not, I'm volunteer to start the branch and start to sketch.
> 
>     Thoughts ?
> 
>     Regards
>     JB
> 
>     On 02/04/2018 08:23 PM, Reuven Lax wrote:
>     > Cool, let's chat about this on slack for a bit (which I realized I've been
>     > signed out of for some time).
>     >
>     > Reuven
>     >
>     > On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>     <ma...@nanthrax.net>
>     > <mailto:jb@nanthrax.net <ma...@nanthrax.net>>> wrote:
>     >
>     >     Sorry guys, I was off today. Happy to be part of the party too ;)
>     >
>     >     Regards
>     >     JB
>     >
>     >     On 02/04/2018 06:19 PM, Reuven Lax wrote:
>     >     > Romain, since you're interested maybe the two of us should put
>     together a
>     >     > proposal for how to set this things (hints, schema) on PCollections?
>     I don't
>     >     > think it'll be hard - the previous list thread on hints already
>     agreed on a
>     >     > general approach, and we would just need to flesh it out.
>     >     >
>     >     > BTW in the past when I looked, Json schemas seemed to have some odd
>     limitations
>     >     > inherited from Javascript (e.g. no distinction between integer and
>     >     > floating-point types). Is that still true?
>     >     >
>     >     > Reuven
>     >     >
>     >     > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau
>     <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>     >     >
>     >     >
>     >     >
>     >     >     2018-02-04 17:53 GMT+01:00 Reuven Lax <relax@google.com
>     <ma...@google.com> <mailto:relax@google.com <ma...@google.com>>
>     >     >     <mailto:relax@google.com <ma...@google.com>
>     <mailto:relax@google.com <ma...@google.com>>>>:
>     >     >
>     >     >
>     >     >
>     >     >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>     >     >         <rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>     >     >
>     >     >
>     >     >             2018-02-04 17:37 GMT+01:00 Reuven Lax <relax@google.com
>     <ma...@google.com> <mailto:relax@google.com <ma...@google.com>>
>     >     >             <mailto:relax@google.com <ma...@google.com>
>     <mailto:relax@google.com <ma...@google.com>>>>:
>     >     >
>     >     >                 I'm not sure where proto comes from here. Proto is
>     one example
>     >     >                 of a type that has a schema, but only one example.
>     >     >
>     >     >                 1. In the initial prototype I want to avoid
>     modifying the
>     >     >                 PCollection API. So I think it's best to create a
>     special
>     >     >                 SchemaCoder, and pass the schema into this coder.
>     Later we
>     >     might
>     >     >                 targeted APIs for this instead of going through a coder.
>     >     >                 1.a I don't see what hints have to do with this? 
>     >     >
>     >     >
>     >     >             Hints are a way to replace the new API and unify the way
>     to pass
>     >     >             metadata in beam instead of adding a new custom way each
>     time.
>     >     >
>     >     >
>     >     >         I don't think schema is a hint. But I hear what your saying
>     - hint
>     >     is a
>     >     >         type of PCollection metadata as is schema, and we should have a
>     >     unified
>     >     >         API for setting such metadata. 
>     >     >
>     >     >
>     >     >     :), Ismael pointed me out earlier this week that "hint" had an
>     old meaning
>     >     >     in beam. My usage is purely the one done in most EE spec (your
>     >     "metadata" in
>     >     >     previous answer). But guess we are aligned on the meaning now,
>     just wanted
>     >     >     to be sure.
>     >     >      
>     >     >
>     >     >          
>     >     >
>     >     >              
>     >     >
>     >     >
>     >     >                 2. BeamSQL already has a generic record type which fits
>     >     this use
>     >     >                 case very well (though we might modify it). However as
>     >     mentioned
>     >     >                 in the doc, the user is never forced to use this generic
>     >     record
>     >     >                 type.
>     >     >
>     >     >
>     >     >             Well yes and not. A type already exists but 1. it is
>     very strictly
>     >     >             limited (flat/columns only which is very few of what big
>     data SQL
>     >     >             can do) and 2. it must be aligned on the converge of
>     generic data
>     >     >             the schema will bring (really read "aligned" as "dropped
>     in favor
>     >     >             of" - deprecated being a smooth way to do it).
>     >     >
>     >     >
>     >     >         As I said the existing class needs to be modified and extended,
>     >     and not
>     >     >         just for this schema us was. It was meant to represent
>     Calcite SQL
>     >     rows,
>     >     >         but doesn't quite even do that yet (Calcite supports nested
>     rows).
>     >     >         However I think it's the right basis to start from.
>     >     >
>     >     >
>     >     >     Agree on the state. Current impl issues I hit (additionally to
>     the nested
>     >     >     support which would require by itself a kind of visitor
>     solution) are the
>     >     >     fact to own the schema in the record and handle field by field the
>     >     >     serialization instead of as a whole which is how it would be handled
>     >     with a
>     >     >     schema IMHO.
>     >     >
>     >     >     Concretely what I don't want is to do a PoC which works - they
>     all work
>     >     >     right? and integrate to beam without thinking to a global
>     solution for
>     >     this
>     >     >     generic record issue and its schema standardization. This is where
>     >     Json(-P)
>     >     >     has a lot of value IMHO but requires a bit more love than just
>     adding
>     >     schema
>     >     >     in the model.
>     >     >      
>     >     >
>     >     >          
>     >     >
>     >     >
>     >     >             So long story short the main work of this schema track
>     is not only
>     >     >             on using schema in runners and other ways but also
>     starting to
>     >     make
>     >     >             beam consistent with itself which is probably the most
>     important
>     >     >             outcome since it is the user facing side of this work.
>     >     >              
>     >     >
>     >     >
>     >     >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
>     >     >                 <rmannibucau@gmail.com
>     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>> wrote:
>     >     >
>     >     >                     @Reuven: is the proto only about passing schema
>     or also the
>     >     >                     generic type?
>     >     >
>     >     >                     There are 2.5 topics to solve this issue:
>     >     >
>     >     >                     1. How to pass schema
>     >     >                     1.a. hints?
>     >     >                     2. What is the generic record type associated to
>     a schema
>     >     >                     and how to express a schema relatively to it
>     >     >
>     >     >                     I would be happy to help on 1.a and 2 somehow if
>     you need.
>     >     >
>     >     >                     Le 4 févr. 2018 03:30, "Reuven Lax"
>     <relax@google.com <ma...@google.com> <mailto:relax@google.com
>     <ma...@google.com>>
>     >     >                     <mailto:relax@google.com
>     <ma...@google.com> <mailto:relax@google.com
>     <ma...@google.com>>>> a
>     >     écrit :
>     >     >
>     >     >                         One more thing. If anyone here has
>     experience with
>     >     >                         various OSS metadata stores (e.g. Kafka
>     Schema Registry
>     >     >                         is one example), would you like to
>     collaborate on
>     >     >                         implementation? I want to make sure that
>     source schemas
>     >     >                         can be stored in a variety of OSS metadata
>     stores, and
>     >     >                         be easily pulled into a Beam pipeline.
>     >     >
>     >     >                         Reuven
>     >     >
>     >     >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>     >     >                         <relax@google.com <ma...@google.com>
>     <mailto:relax@google.com <ma...@google.com>> <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>> wrote:
>     >     >
>     >     >                             Hi all,
>     >     >
>     >     >                             If there are no concerns, I would like
>     to start
>     >     >                             working on a prototype. It's just a
>     prototype, so I
>     >     >                             don't think it will have the final API
>     (e.g. for the
>     >     >                             prototype I'm going to avoid change the
>     API of
>     >     >                             PCollection, and use a "special" Coder
>     instead).
>     >     >                             Also even once we go beyond prototype,
>     it will be
>     >     >                             @Experimental for some time, so the API
>     will not be
>     >     >                             fixed in stone.
>     >     >
>     >     >                             Any more comments on this approach
>     before we start
>     >     >                             implementing a prototype?
>     >     >
>     >     >                             Reuven
>     >     >
>     >     >                             On Wed, Jan 31, 2018 at 1:12 PM, Romain
>     Manni-Bucau
>     >     >                             <rmannibucau@gmail.com
>     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>
>     >     >                             <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com> <mailto:rmannibucau@gmail.com
>     <ma...@gmail.com>>>> wrote:
>     >     >
>     >     >                                 If you need help on the json part
>     I'm happy to
>     >     >                                 help. To give a few hints on what is
>     very
>     >     >                                 doable: we can add an avro module to
>     johnzon
>     >     >                                 (asf json{p,b} impl) to back jsonp
>     by avro
>     >     >                                 (guess it will be one of the first
>     to be asked)
>     >     >                                 for instance.
>     >     >
>     >     >
>     >     >                                 Romain Manni-Bucau
>     >     >                                 @rmannibucau
>     >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>> |
>     >     >                                  Blog <https://rmannibucau.metawerx.net/
>     >     <https://rmannibucau.metawerx.net/>> | Old
>     >     >                                 Blog <http://rmannibucau.wordpress.com
>     >     <http://rmannibucau.wordpress.com>> | Github
>     >     >                                 <https://github.com/rmannibucau
>     >     <https://github.com/rmannibucau>> | LinkedIn
>     >     >                                 <https://www.linkedin.com/in/rmannibucau
>     >     <https://www.linkedin.com/in/rmannibucau>>
>     >     >
>     >     >                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
>     >     >                                 <relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>
>     <mailto:relax@google.com <ma...@google.com> <mailto:relax@google.com
>     <ma...@google.com>>>>:
>     >     >
>     >     >                                     Agree. The initial
>     implementation will be a
>     >     >                                     prototype.
>     >     >
>     >     >                                     On Wed, Jan 31, 2018 at 12:21 PM,
>     >     >                                     Jean-Baptiste Onofré
>     <jb@nanthrax.net <ma...@nanthrax.net> <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>>
>     >     >                                     <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>> wrote:
>     >     >
>     >     >                                         Hi Reuven,
>     >     >
>     >     >                                         Agree to be able to describe the
>     >     schema
>     >     >                                         with different format. The
>     good point
>     >     >                                         about json schemas is that
>     they are
>     >     >                                         described by a spec. My point is
>     >     also to
>     >     >                                         avoid the reinvent the
>     wheel. Just an
>     >     >                                         abstract to be able to use
>     Avro, Json,
>     >     >                                         Calcite, custom schema
>     descriptors
>     >     would
>     >     >                                         be great.
>     >     >
>     >     >                                         Using coder to describe a schema
>     >     sounds
>     >     >                                         like a smart move to implement
>     >     quickly.
>     >     >                                         However, it has to be clear
>     in term of
>     >     >                                         documentation to avoid "side
>     >     effect". I
>     >     >                                         still think
>     PCollection.setSchema() is
>     >     >                                         better: it should be
>     metadata (or hint
>     >     >                                         ;))) on the PCollection.
>     >     >
>     >     >                                         Regards
>     >     >                                         JB
>     >     >
>     >     >                                         On 31/01/2018 20:16, Reuven
>     Lax wrote:
>     >     >
>     >     >                                             As to the question of
>     how a schema
>     >     >                                             should be specified, I
>     want to
>     >     >                                             support several common
>     schema
>     >     >                                             formats. So if a user
>     has a Json
>     >     >                                             schema, or an Avro
>     schema, or a
>     >     >                                             Calcite schema, etc. there
>     >     should be
>     >     >                                             adapters that allow
>     setting a
>     >     schema
>     >     >                                             from any of them. I
>     don't think we
>     >     >                                             should prefer one over
>     the other.
>     >     >                                             While Romain is right
>     that many
>     >     >                                             people know Json, I
>     think far
>     >     fewer
>     >     >                                             people know Json schemas.
>     >     >
>     >     >                                             Agree, schemas should not be
>     >     >                                             enforced (for one thing,
>     that
>     >     >                                             wouldn't be backwards
>     >     compatible!).
>     >     >                                             I think for the initial
>     >     prototype I
>     >     >                                             will probably use a special
>     >     coder to
>     >     >                                             represent the schema (with
>     >     setSchema
>     >     >                                             an option on the coder),
>     largely
>     >     >                                             because it doesn't require
>     >     modifying
>     >     >                                             PCollection. However I think
>     >     longer
>     >     >                                             term a schema should be an
>     >     optional
>     >     >                                             piece of metadata on the
>     >     PCollection
>     >     >                                             object. Similar to the
>     previous
>     >     >                                             discussion about
>     "hints," I think
>     >     >                                             this can be set on the
>     producing
>     >     >                                             PTransform, and a SetSchema
>     >     >                                             PTransform will allow
>     attaching a
>     >     >                                             schema to any
>     PCollection (i.e.
>     >     >                                           
>      pc.apply(SetSchema.of(schema))).
>     >     >                                             This part isn't designed
>     yet,
>     >     but I
>     >     >                                             think schema should be
>     similar to
>     >     >                                             hints, it's just another
>     piece of
>     >     >                                             metadata on the PCollection
>     >     (though
>     >     >                                             something interpreted by the
>     >     model,
>     >     >                                             where hints are
>     interpreted by the
>     >     >                                             runner)
>     >     >
>     >     >                                             Reuven
>     >     >
>     >     >                                             On Tue, Jan 30, 2018 at
>     1:37 AM,
>     >     >                                             Jean-Baptiste Onofré
>     >     >                                             <jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>>> wrote:
>     >     >
>     >     >                                                 Hi,
>     >     >
>     >     >                                                 I think we should
>     avoid to mix
>     >     >                                             two things in the discussion
>     >     (and so
>     >     >                                                 the document):
>     >     >
>     >     >                                                 1. The element of the
>     >     collection
>     >     >                                             and the schema itself
>     are two
>     >     >                                                 different things.
>     >     >                                                 By essence, Beam
>     should not
>     >     >                                             enforce any schema.
>     That's why
>     >     I think
>     >     >                                                 it's a good
>     >     >                                                 idea to set the schema
>     >     >                                             optionally on the
>     PCollection
>     >     >                                                
>     (pcollection.setSchema()).
>     >     >
>     >     >                                                 2. From point 1
>     comes two
>     >     >                                             questions: how do we
>     represent a
>     >     >                                             schema ?
>     >     >                                                 How can we
>     >     >                                                 leverage the schema to
>     >     simplify
>     >     >                                             the serialization of the
>     >     element in the
>     >     >                                                 PCollection and
>     query ? These
>     >     >                                             two questions are not
>     directly
>     >     related.
>     >     >
>     >     >                                                   2.1 How do we
>     represent
>     >     the schema
>     >     >                                                 Json Schema is a very
>     >     >                                             interesting idea. It
>     could be an
>     >     >                                             abstract and
>     >     >                                                 other
>     >     >                                                 providers, like
>     Avro, can be
>     >     >                                             bind on it. It's part of
>     the json
>     >     >                                                 processing spec
>     >     >                                                 (javax).
>     >     >
>     >     >                                                   2.2. How do we
>     leverage the
>     >     >                                             schema for query and
>     serialization
>     >     >                                                 Also in the spec,
>     json pointer
>     >     >                                             is interesting for the
>     querying.
>     >     >                                                 Regarding the
>     >     >                                                 serialization,
>     jackson or
>     >     other
>     >     >                                             data binder can be used.
>     >     >
>     >     >                                                 It's still rough
>     ideas in my
>     >     >                                             mind, but I like
>     Romain's idea
>     >     about
>     >     >                                                 json-p usage.
>     >     >
>     >     >                                                 Once 2.3.0 release
>     is out, I
>     >     >                                             will start to update the
>     >     document with
>     >     >                                                 those ideas,
>     >     >                                                 and PoC.
>     >     >
>     >     >                                                 Thanks !
>     >     >                                                 Regards
>     >     >                                                 JB
>     >     >
>     >     >                                                 On 01/30/2018 08:42
>     AM, Romain
>     >     >                                             Manni-Bucau wrote:
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 > Le 30 janv. 2018
>     01:09,
>     >     >                                             "Reuven Lax"
>     <relax@google.com <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>>
>     >     >                                                  >
>     >     <mailto:relax@google.com <ma...@google.com>
>     <mailto:relax@google.com <ma...@google.com>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>
>     >     >                                             <mailto:relax@google.com
>     <ma...@google.com>
>     >     <mailto:relax@google.com <ma...@google.com>>>>>> a écrit :
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 >     On Mon, Jan
>     29, 2018 at
>     >     >                                             12:17 PM, Romain Manni-Bucau
>     >     >                                             <rmannibucau@gmail.com
>     <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     >                                           
>      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
>     >     >                                           
>      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     >                                           
>      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>
>     >     >                                                  >   
>     >     >                                           
>       <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     >                                           
>      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
>     >     >
>     >     >                                                
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>
>     >     >                                           
>      <mailto:rmannibucau@gmail.com <ma...@gmail.com>
>     >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>>>> wrote:
>     >     >                                                  >
>     >     >                                                  >         Hi
>     >     >                                                  >
>     >     >                                                  >         I have some
>     >     questions
>     >     >                                             on this: how hierarchic
>     schemas
>     >     >                                                 would work? Seems
>     >     >                                                  >         it is not
>     really
>     >     >                                             supported by the
>     ecosystem (out of
>     >     >                                                 custom stuff) :(.
>     >     >                                                  >         How would it
>     >     >                                             integrate smoothly with
>     other
>     >     >                                             generic record
>     >     >                                                 types - N bridges?
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >     Do you mean
>     nested
>     >     >                                             schemas? What do you
>     mean here?
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  > Yes, sorry -
>     wrote the mail
>     >     >                                             too late ;). Was hierarchic
>     >     data and
>     >     >                                                 nested schemas.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >       
>      Concretely I wonder
>     >     >                                             if using json API couldnt be
>     >     >                                                 beneficial: json-p is a
>     >     >                                                  >         nice generic
>     >     >                                             abstraction with a built in
>     >     querying
>     >     >                                                 mecanism (jsonpointer)
>     >     >                                                  >         but no actual
>     >     >                                             serialization (even if
>     json and
>     >     >                                             binary json
>     >     >                                                 are very
>     >     >                                                  >         natural).
>     The big
>     >     >                                             advantage is to have a
>     well known
>     >     >                                                 ecosystem - who
>     >     >                                                  >         doesnt
>     know json
>     >     >                                             today? - that beam can reuse
>     >     for free:
>     >     >                                                 JsonObject
>     >     >                                                  >         (guess we
>     dont want
>     >     >                                             JsonValue abstraction)
>     for the
>     >     record
>     >     >                                                 type,
>     >     >                                                  >       
>      jsonschema standard
>     >     >                                             for the schema, jsonpointer
>     >     for the
>     >     >                                                  >       
>     >      delection/projection
>     >     >                                             etc... It doesnt enforce the
>     >     actual
>     >     >                                                 serialization
>     >     >                                                  >         (json,
>     smile, avro,
>     >     >                                             ...) but provide an
>     expressive and
>     >     >                                                 alread known API
>     >     >                                                  >         so i see
>     it as
>     >     a big
>     >     >                                             win-win for users (no
>     need to
>     >     learn
>     >     >                                                 a new API and
>     >     >                                                  >         use N bridges
>     >     in all
>     >     >                                             ways) and beam (impls
>     are here and
>     >     >                                                 API design
>     >     >                                                  >         already
>     thought).
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >     I assume
>     you're talking
>     >     >                                             about the API for
>     setting schemas,
>     >     >                                                 not using them.
>     >     >                                                  >     Json has many
>     downsides
>     >     >                                             and I'm not sure it's
>     true that
>     >     >                                                 everyone knows it;
>     >     >                                                  >     there are also
>     >     competing
>     >     >                                             schema APIs, such as
>     Avro etc..
>     >     >                                                 However I think we
>     >     >                                                  >     should give
>     Json a fair
>     >     >                                             evaluation before
>     dismissing it.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  > It is a wider
>     topic than
>     >     >                                             schema. Actually schema are
>     >     not the
>     >     >                                                 first citizen but a
>     >     >                                                  > generic data
>     representation
>     >     >                                             is. That is where json
>     hits almost
>     >     >                                                 any other API.
>     >     >                                                  > Then, when it
>     comes to
>     >     >                                             schema, json has a standard
>     >     for that
>     >     >                                             so we
>     >     >                                                 are all good.
>     >     >                                                  >
>     >     >                                                  > Also json has a good
>     >     indexing
>     >     >                                             API compared to
>     alternatives which
>     >     >                                                 are sometimes a
>     >     >                                                  > bit faster - for noop
>     >     >                                             transforms - but are
>     hardly usable
>     >     >                                             or make
>     >     >                                                 the code not
>     >     >                                                  > that readable.
>     >     >                                                  >
>     >     >                                                  > Avro is a nice
>     >     competitor but
>     >     >                                             it is compatible - actually
>     >     avro is
>     >     >                                                 json driven by
>     >     >                                                  > design - but its
>     API is far
>     >     >                                             to be that easy due to
>     its schema
>     >     >                                                 enforcement which
>     >     >                                                  > is heavvvyyy and
>     worse
>     >     is you
>     >     >                                             cant work with avro
>     without a
>     >     >                                                 schema. Json would
>     >     >                                                  > allow to
>     reconciliate the
>     >     >                                             dynamic and static cases
>     since
>     >     the job
>     >     >                                                 wouldnt change
>     >     >                                                  > except the setschema.
>     >     >                                                  >
>     >     >                                                  > That is why I think
>     >     json is a
>     >     >                                             good compromise and having a
>     >     >                                                 standard API for it
>     >     >                                                  > allow to fully
>     >     customize the
>     >     >                                             imol as will if needed -
>     even
>     >     using
>     >     >                                                 avro or protobuf.
>     >     >                                                  >
>     >     >                                                  > Side note on beam
>     api:
>     >     i dont
>     >     >                                             think it is good to use
>     a main API
>     >     >                                                 for runner
>     >     >                                                  > optimization. It
>     enforces
>     >     >                                             something to be shared
>     on all
>     >     runners
>     >     >                                                 but not widely
>     >     >                                                  > usable. It is also
>     >     misleading
>     >     >                                             for users. Would you set
>     a flink
>     >     >                                                 pipeline option
>     >     >                                                  > with dataflow? My
>     proposal
>     >     >                                             here is to use hints -
>     >     properties -
>     >     >                                                 instead of
>     >     >                                                  > something hardly
>     defined in
>     >     >                                             the API then standardize
>     it if all
>     >     >                                                 runners support it.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >         Wdyt?
>     >     >                                                  >
>     >     >                                                  >         Le 29
>     janv. 2018
>     >     >                                             06:24, "Jean-Baptiste
>     Onofré"
>     >     >                                                 <jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>>
>     >     >                                                  >       
>     >     >                                              <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>
>     >     >                                             <mailto:jb@nanthrax.net
>     <ma...@nanthrax.net>
>     >     <mailto:jb@nanthrax.net <ma...@nanthrax.net>>>>>> a écrit :
>     >     >
>     >     >                                                  >
>     >     >                                                  >             Hi
>     Reuven,
>     >     >                                                  >
>     >     >                                                  >           
>      Thanks for the
>     >     >                                             update ! As I'm working with
>     >     you on
>     >     >                                                 this, I fully
>     >     >                                                  >             agree
>     and great
>     >     >                                                  >             doc
>     >     gathering the
>     >     >                                             ideas.
>     >     >                                                  >
>     >     >                                                  >             It's
>     clearly
>     >     >                                             something we have to add
>     asap
>     >     in Beam,
>     >     >                                                 because it would
>     >     >                                                  >             allow new
>     >     >                                                  >             use cases
>     >     for our
>     >     >                                             users (in a simple way)
>     and open
>     >     >                                                 new areas for the
>     >     >                                                  >             runners
>     >     >                                                  >             (for
>     instance
>     >     >                                             dataframe support in the
>     Spark
>     >     runner).
>     >     >                                                  >
>     >     >                                                  >             By
>     the way,
>     >     while
>     >     >                                             ago, I created BEAM-3437 to 
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Posted by Reuven Lax <re...@google.com>.
Yes, I do have a PoC in progress. The Beam Row class was being refactored,
so I paused to wait for that to finish.


On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> Hi Reuven,
>
> I revive this discussion as I think it would be a great addition.
>
> We had discussion on the fly, but I think now, as base for discussion, it
> would
> be great to have a feature branch where we can start some sketch/impl and
> discuss.
>
> @Reuven, did you start a PoC with what you proposed:
> - SchemaCoder
> - SchemaRegistry
> - @FieldAccess on DoFn
> - Select.fields PTransform
> ?
>
> If not, I'm volunteer to start the branch and start to sketch.
>
> Thoughts ?
>
> Regards
> JB
>
> On 02/04/2018 08:23 PM, Reuven Lax wrote:
> > Cool, let's chat about this on slack for a bit (which I realized I've
> been
> > signed out of for some time).
> >
> > Reuven
> >
> > On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> > <ma...@nanthrax.net>> wrote:
> >
> >     Sorry guys, I was off today. Happy to be part of the party too ;)
> >
> >     Regards
> >     JB
> >
> >     On 02/04/2018 06:19 PM, Reuven Lax wrote:
> >     > Romain, since you're interested maybe the two of us should put
> together a
> >     > proposal for how to set this things (hints, schema) on
> PCollections? I don't
> >     > think it'll be hard - the previous list thread on hints already
> agreed on a
> >     > general approach, and we would just need to flesh it out.
> >     >
> >     > BTW in the past when I looked, Json schemas seemed to have some
> odd limitations
> >     > inherited from Javascript (e.g. no distinction between integer and
> >     > floating-point types). Is that still true?
> >     >
> >     > Reuven
> >     >
> >     > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
> rmannibucau@gmail.com <ma...@gmail.com>
> >     > <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >
> >     >
> >     >     2018-02-04 17:53 GMT+01:00 Reuven Lax <relax@google.com
> <ma...@google.com>
> >     >     <mailto:relax@google.com <ma...@google.com>>>:
> >     >
> >     >
> >     >
> >     >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
> >     >         <rmannibucau@gmail.com <ma...@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >
> >     >             2018-02-04 17:37 GMT+01:00 Reuven Lax <
> relax@google.com <ma...@google.com>
> >     >             <mailto:relax@google.com <ma...@google.com>>>:
> >     >
> >     >                 I'm not sure where proto comes from here. Proto is
> one example
> >     >                 of a type that has a schema, but only one example.
> >     >
> >     >                 1. In the initial prototype I want to avoid
> modifying the
> >     >                 PCollection API. So I think it's best to create a
> special
> >     >                 SchemaCoder, and pass the schema into this coder.
> Later we
> >     might
> >     >                 targeted APIs for this instead of going through a
> coder.
> >     >                 1.a I don't see what hints have to do with this?
> >     >
> >     >
> >     >             Hints are a way to replace the new API and unify the
> way to pass
> >     >             metadata in beam instead of adding a new custom way
> each time.
> >     >
> >     >
> >     >         I don't think schema is a hint. But I hear what your
> saying - hint
> >     is a
> >     >         type of PCollection metadata as is schema, and we should
> have a
> >     unified
> >     >         API for setting such metadata.
> >     >
> >     >
> >     >     :), Ismael pointed me out earlier this week that "hint" had an
> old meaning
> >     >     in beam. My usage is purely the one done in most EE spec (your
> >     "metadata" in
> >     >     previous answer). But guess we are aligned on the meaning now,
> just wanted
> >     >     to be sure.
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >                 2. BeamSQL already has a generic record type which
> fits
> >     this use
> >     >                 case very well (though we might modify it).
> However as
> >     mentioned
> >     >                 in the doc, the user is never forced to use this
> generic
> >     record
> >     >                 type.
> >     >
> >     >
> >     >             Well yes and not. A type already exists but 1. it is
> very strictly
> >     >             limited (flat/columns only which is very few of what
> big data SQL
> >     >             can do) and 2. it must be aligned on the converge of
> generic data
> >     >             the schema will bring (really read "aligned" as
> "dropped in favor
> >     >             of" - deprecated being a smooth way to do it).
> >     >
> >     >
> >     >         As I said the existing class needs to be modified and
> extended,
> >     and not
> >     >         just for this schema us was. It was meant to represent
> Calcite SQL
> >     rows,
> >     >         but doesn't quite even do that yet (Calcite supports
> nested rows).
> >     >         However I think it's the right basis to start from.
> >     >
> >     >
> >     >     Agree on the state. Current impl issues I hit (additionally to
> the nested
> >     >     support which would require by itself a kind of visitor
> solution) are the
> >     >     fact to own the schema in the record and handle field by field
> the
> >     >     serialization instead of as a whole which is how it would be
> handled
> >     with a
> >     >     schema IMHO.
> >     >
> >     >     Concretely what I don't want is to do a PoC which works - they
> all work
> >     >     right? and integrate to beam without thinking to a global
> solution for
> >     this
> >     >     generic record issue and its schema standardization. This is
> where
> >     Json(-P)
> >     >     has a lot of value IMHO but requires a bit more love than just
> adding
> >     schema
> >     >     in the model.
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >             So long story short the main work of this schema track
> is not only
> >     >             on using schema in runners and other ways but also
> starting to
> >     make
> >     >             beam consistent with itself which is probably the most
> important
> >     >             outcome since it is the user facing side of this work.
> >     >
> >     >
> >     >
> >     >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
> >     >                 <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>>>
> wrote:
> >     >
> >     >                     @Reuven: is the proto only about passing
> schema or also the
> >     >                     generic type?
> >     >
> >     >                     There are 2.5 topics to solve this issue:
> >     >
> >     >                     1. How to pass schema
> >     >                     1.a. hints?
> >     >                     2. What is the generic record type associated
> to a schema
> >     >                     and how to express a schema relatively to it
> >     >
> >     >                     I would be happy to help on 1.a and 2 somehow
> if you need.
> >     >
> >     >                     Le 4 févr. 2018 03:30, "Reuven Lax" <
> relax@google.com <ma...@google.com>
> >     >                     <mailto:relax@google.com <mailto:
> relax@google.com>>> a
> >     écrit :
> >     >
> >     >                         One more thing. If anyone here has
> experience with
> >     >                         various OSS metadata stores (e.g. Kafka
> Schema Registry
> >     >                         is one example), would you like to
> collaborate on
> >     >                         implementation? I want to make sure that
> source schemas
> >     >                         can be stored in a variety of OSS metadata
> stores, and
> >     >                         be easily pulled into a Beam pipeline.
> >     >
> >     >                         Reuven
> >     >
> >     >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
> >     >                         <relax@google.com <ma...@google.com>
> <mailto:relax@google.com
> >     <ma...@google.com>>> wrote:
> >     >
> >     >                             Hi all,
> >     >
> >     >                             If there are no concerns, I would like
> to start
> >     >                             working on a prototype. It's just a
> prototype, so I
> >     >                             don't think it will have the final API
> (e.g. for the
> >     >                             prototype I'm going to avoid change
> the API of
> >     >                             PCollection, and use a "special" Coder
> instead).
> >     >                             Also even once we go beyond prototype,
> it will be
> >     >                             @Experimental for some time, so the
> API will not be
> >     >                             fixed in stone.
> >     >
> >     >                             Any more comments on this approach
> before we start
> >     >                             implementing a prototype?
> >     >
> >     >                             Reuven
> >     >
> >     >                             On Wed, Jan 31, 2018 at 1:12 PM,
> Romain Manni-Bucau
> >     >                             <rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>
> >     >                             <mailto:rmannibucau@gmail.com <mailto:
> rmannibucau@gmail.com>>> wrote:
> >     >
> >     >                                 If you need help on the json part
> I'm happy to
> >     >                                 help. To give a few hints on what
> is very
> >     >                                 doable: we can add an avro module
> to johnzon
> >     >                                 (asf json{p,b} impl) to back jsonp
> by avro
> >     >                                 (guess it will be one of the first
> to be asked)
> >     >                                 for instance.
> >     >
> >     >
> >     >                                 Romain Manni-Bucau
> >     >                                 @rmannibucau
> >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>>
> |
> >     >                                  Blog <
> https://rmannibucau.metawerx.net/
> >     <https://rmannibucau.metawerx.net/>> | Old
> >     >                                 Blog <
> http://rmannibucau.wordpress.com
> >     <http://rmannibucau.wordpress.com>> | Github
> >     >                                 <https://github.com/rmannibucau
> >     <https://github.com/rmannibucau>> | LinkedIn
> >     >                                 <
> https://www.linkedin.com/in/rmannibucau
> >     <https://www.linkedin.com/in/rmannibucau>>
> >     >
> >     >                                 2018-01-31 22:06 GMT+01:00 Reuven
> Lax
> >     >                                 <relax@google.com
> >     <ma...@google.com> <mailto:relax@google.com <mailto:
> relax@google.com>>>:
> >     >
> >     >                                     Agree. The initial
> implementation will be a
> >     >                                     prototype.
> >     >
> >     >                                     On Wed, Jan 31, 2018 at 12:21
> PM,
> >     >                                     Jean-Baptiste Onofré <
> jb@nanthrax.net <ma...@nanthrax.net>
> >     >                                     <mailto:jb@nanthrax.net
> >     <ma...@nanthrax.net>>> wrote:
> >     >
> >     >                                         Hi Reuven,
> >     >
> >     >                                         Agree to be able to
> describe the
> >     schema
> >     >                                         with different format. The
> good point
> >     >                                         about json schemas is that
> they are
> >     >                                         described by a spec. My
> point is
> >     also to
> >     >                                         avoid the reinvent the
> wheel. Just an
> >     >                                         abstract to be able to use
> Avro, Json,
> >     >                                         Calcite, custom schema
> descriptors
> >     would
> >     >                                         be great.
> >     >
> >     >                                         Using coder to describe a
> schema
> >     sounds
> >     >                                         like a smart move to
> implement
> >     quickly.
> >     >                                         However, it has to be
> clear in term of
> >     >                                         documentation to avoid
> "side
> >     effect". I
> >     >                                         still think
> PCollection.setSchema() is
> >     >                                         better: it should be
> metadata (or hint
> >     >                                         ;))) on the PCollection.
> >     >
> >     >                                         Regards
> >     >                                         JB
> >     >
> >     >                                         On 31/01/2018 20:16,
> Reuven Lax wrote:
> >     >
> >     >                                             As to the question of
> how a schema
> >     >                                             should be specified, I
> want to
> >     >                                             support several common
> schema
> >     >                                             formats. So if a user
> has a Json
> >     >                                             schema, or an Avro
> schema, or a
> >     >                                             Calcite schema, etc.
> there
> >     should be
> >     >                                             adapters that allow
> setting a
> >     schema
> >     >                                             from any of them. I
> don't think we
> >     >                                             should prefer one over
> the other.
> >     >                                             While Romain is right
> that many
> >     >                                             people know Json, I
> think far
> >     fewer
> >     >                                             people know Json
> schemas.
> >     >
> >     >                                             Agree, schemas should
> not be
> >     >                                             enforced (for one
> thing, that
> >     >                                             wouldn't be backwards
> >     compatible!).
> >     >                                             I think for the initial
> >     prototype I
> >     >                                             will probably use a
> special
> >     coder to
> >     >                                             represent the schema
> (with
> >     setSchema
> >     >                                             an option on the
> coder), largely
> >     >                                             because it doesn't
> require
> >     modifying
> >     >                                             PCollection. However I
> think
> >     longer
> >     >                                             term a schema should
> be an
> >     optional
> >     >                                             piece of metadata on
> the
> >     PCollection
> >     >                                             object. Similar to the
> previous
> >     >                                             discussion about
> "hints," I think
> >     >                                             this can be set on the
> producing
> >     >                                             PTransform, and a
> SetSchema
> >     >                                             PTransform will allow
> attaching a
> >     >                                             schema to any
> PCollection (i.e.
> >     >
>  pc.apply(SetSchema.of(schema))).
> >     >                                             This part isn't
> designed yet,
> >     but I
> >     >                                             think schema should be
> similar to
> >     >                                             hints, it's just
> another piece of
> >     >                                             metadata on the
> PCollection
> >     (though
> >     >                                             something interpreted
> by the
> >     model,
> >     >                                             where hints are
> interpreted by the
> >     >                                             runner)
> >     >
> >     >                                             Reuven
> >     >
> >     >                                             On Tue, Jan 30, 2018
> at 1:37 AM,
> >     >                                             Jean-Baptiste Onofré
> >     >                                             <jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>>> wrote:
> >     >
> >     >                                                 Hi,
> >     >
> >     >                                                 I think we should
> avoid to mix
> >     >                                             two things in the
> discussion
> >     (and so
> >     >                                                 the document):
> >     >
> >     >                                                 1. The element of
> the
> >     collection
> >     >                                             and the schema itself
> are two
> >     >                                                 different things.
> >     >                                                 By essence, Beam
> should not
> >     >                                             enforce any schema.
> That's why
> >     I think
> >     >                                                 it's a good
> >     >                                                 idea to set the
> schema
> >     >                                             optionally on the
> PCollection
> >     >
> (pcollection.setSchema()).
> >     >
> >     >                                                 2. From point 1
> comes two
> >     >                                             questions: how do we
> represent a
> >     >                                             schema ?
> >     >                                                 How can we
> >     >                                                 leverage the
> schema to
> >     simplify
> >     >                                             the serialization of
> the
> >     element in the
> >     >                                                 PCollection and
> query ? These
> >     >                                             two questions are not
> directly
> >     related.
> >     >
> >     >                                                   2.1 How do we
> represent
> >     the schema
> >     >                                                 Json Schema is a
> very
> >     >                                             interesting idea. It
> could be an
> >     >                                             abstract and
> >     >                                                 other
> >     >                                                 providers, like
> Avro, can be
> >     >                                             bind on it. It's part
> of the json
> >     >                                                 processing spec
> >     >                                                 (javax).
> >     >
> >     >                                                   2.2. How do we
> leverage the
> >     >                                             schema for query and
> serialization
> >     >                                                 Also in the spec,
> json pointer
> >     >                                             is interesting for the
> querying.
> >     >                                                 Regarding the
> >     >                                                 serialization,
> jackson or
> >     other
> >     >                                             data binder can be
> used.
> >     >
> >     >                                                 It's still rough
> ideas in my
> >     >                                             mind, but I like
> Romain's idea
> >     about
> >     >                                                 json-p usage.
> >     >
> >     >                                                 Once 2.3.0 release
> is out, I
> >     >                                             will start to update
> the
> >     document with
> >     >                                                 those ideas,
> >     >                                                 and PoC.
> >     >
> >     >                                                 Thanks !
> >     >                                                 Regards
> >     >                                                 JB
> >     >
> >     >                                                 On 01/30/2018
> 08:42 AM, Romain
> >     >                                             Manni-Bucau wrote:
> >     >                                                 >
> >     >                                                 >
> >     >                                                 > Le 30 janv. 2018
> 01:09,
> >     >                                             "Reuven Lax" <
> relax@google.com
> >     <ma...@google.com>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>>>
> >     >                                                  >
> >     <mailto:relax@google.com <ma...@google.com>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>
> >     >                                             <mailto:
> relax@google.com
> >     <ma...@google.com>>>>> a écrit :
> >     >                                                 >
> >     >                                                 >
> >     >                                                 >
> >     >                                                 >     On Mon, Jan
> 29, 2018 at
> >     >                                             12:17 PM, Romain
> Manni-Bucau
> >     >                                             <rmannibucau@gmail.com
> >     <ma...@gmail.com>
> >     >                                             <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>>
> >     >                                             <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>
> >     >                                             <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>>>
> >     >                                                  >
> >     >                                              <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>
> >     >                                             <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>>
> >     >
> >     >
> >     <mailto:rmannibucau@gmail.com <ma...@gmail.com>
> >     >                                             <mailto:
> rmannibucau@gmail.com
> >     <ma...@gmail.com>>>>> wrote:
> >     >                                                  >
> >     >                                                  >         Hi
> >     >                                                  >
> >     >                                                  >         I have
> some
> >     questions
> >     >                                             on this: how
> hierarchic schemas
> >     >                                                 would work? Seems
> >     >                                                  >         it is
> not really
> >     >                                             supported by the
> ecosystem (out of
> >     >                                                 custom stuff) :(.
> >     >                                                  >         How
> would it
> >     >                                             integrate smoothly
> with other
> >     >                                             generic record
> >     >                                                 types - N bridges?
> >     >                                                  >
> >     >                                                  >
> >     >                                                  >     Do you mean
> nested
> >     >                                             schemas? What do you
> mean here?
> >     >                                                  >
> >     >                                                  >
> >     >                                                  > Yes, sorry -
> wrote the mail
> >     >                                             too late ;). Was
> hierarchic
> >     data and
> >     >                                                 nested schemas.
> >     >                                                  >
> >     >                                                  >
> >     >                                                  >
>  Concretely I wonder
> >     >                                             if using json API
> couldnt be
> >     >                                                 beneficial: json-p
> is a
> >     >                                                  >         nice
> generic
> >     >                                             abstraction with a
> built in
> >     querying
> >     >                                                 mecanism
> (jsonpointer)
> >     >                                                  >         but no
> actual
> >     >                                             serialization (even if
> json and
> >     >                                             binary json
> >     >                                                 are very
> >     >                                                  >
>  natural). The big
> >     >                                             advantage is to have a
> well known
> >     >                                                 ecosystem - who
> >     >                                                  >         doesnt
> know json
> >     >                                             today? - that beam can
> reuse
> >     for free:
> >     >                                                 JsonObject
> >     >                                                  >         (guess
> we dont want
> >     >                                             JsonValue abstraction)
> for the
> >     record
> >     >                                                 type,
> >     >                                                  >
>  jsonschema standard
> >     >                                             for the schema,
> jsonpointer
> >     for the
> >     >                                                  >
> >      delection/projection
> >     >                                             etc... It doesnt
> enforce the
> >     actual
> >     >                                                 serialization
> >     >                                                  >         (json,
> smile, avro,
> >     >                                             ...) but provide an
> expressive and
> >     >                                                 alread known API
> >     >                                                  >         so i
> see it as
> >     a big
> >     >                                             win-win for users (no
> need to
> >     learn
> >     >                                                 a new API and
> >     >                                                  >         use N
> bridges
> >     in all
> >     >                                             ways) and beam (impls
> are here and
> >     >                                                 API design
> >     >                                                  >         already
> thought).
> >     >                                                  >
> >     >                                                  >
> >     >                                                  >     I assume
> you're talking
> >     >                                             about the API for
> setting schemas,
> >     >                                                 not using them.
> >     >                                                  >     Json has
> many downsides
> >     >                                             and I'm not sure it's
> true that
> >     >                                                 everyone knows it;
> >     >                                                  >     there are
> also
> >     competing
> >     >                                             schema APIs, such as
> Avro etc..
> >     >                                                 However I think we
> >     >                                                  >     should give
> Json a fair
> >     >                                             evaluation before
> dismissing it.
> >     >                                                  >
> >     >                                                  >
> >     >                                                  > It is a wider
> topic than
> >     >                                             schema. Actually
> schema are
> >     not the
> >     >                                                 first citizen but a
> >     >                                                  > generic data
> representation
> >     >                                             is. That is where json
> hits almost
> >     >                                                 any other API.
> >     >                                                  > Then, when it
> comes to
> >     >                                             schema, json has a
> standard
> >     for that
> >     >                                             so we
> >     >                                                 are all good.
> >     >                                                  >
> >     >                                                  > Also json has a
> good
> >     indexing
> >     >                                             API compared to
> alternatives which
> >     >                                                 are sometimes a
> >     >                                                  > bit faster -
> for noop
> >     >                                             transforms - but are
> hardly usable
> >     >                                             or make
> >     >                                                 the code not
> >     >                                                  > that readable.
> >     >                                                  >
> >     >                                                  > Avro is a nice
> >     competitor but
> >     >                                             it is compatible -
> actually
> >     avro is
> >     >                                                 json driven by
> >     >                                                  > design - but
> its API is far
> >     >                                             to be that easy due to
> its schema
> >     >                                                 enforcement which
> >     >                                                  > is heavvvyyy
> and worse
> >     is you
> >     >                                             cant work with avro
> without a
> >     >                                                 schema. Json would
> >     >                                                  > allow to
> reconciliate the
> >     >                                             dynamic and static
> cases since
> >     the job
> >     >                                                 wouldnt change
> >     >                                                  > except the
> setschema.
> >     >                                                  >
> >     >                                                  > That is why I
> think
> >     json is a
> >     >                                             good compromise and
> having a
> >     >                                                 standard API for it
> >     >                                                  > allow to fully
> >     customize the
> >     >                                             imol as will if needed
> - even
> >     using
> >     >                                                 avro or protobuf.
> >     >                                                  >
> >     >                                                  > Side note on
> beam api:
> >     i dont
> >     >                                             think it is good to
> use a main API
> >     >                                                 for runner
> >     >                                                  > optimization.
> It enforces
> >     >                                             something to be shared
> on all
> >     runners
> >     >                                                 but not widely
> >     >                                                  > usable. It is
> also
> >     misleading
> >     >                                             for users. Would you
> set a flink
> >     >                                                 pipeline option
> >     >                                                  > with dataflow?
> My proposal
> >     >                                             here is to use hints -
> >     properties -
> >     >                                                 instead of
> >     >                                                  > something
> hardly defined in
> >     >                                             the API then
> standardize it if all
> >     >                                                 runners support it.
> >     >                                                  >
> >     >                                                  >
> >     >                                                  >
> >     >                                                  >         Wdyt?
> >     >                                                  >
> >     >                                                  >         Le 29
> janv. 2018
> >     >                                             06:24, "Jean-Baptiste
> Onofré"
> >     >                                                 <jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>>
> >     >                                                  >
> >     >                                              <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>
> >     >                                             <mailto:
> jb@nanthrax.net
> >     <ma...@nanthrax.net>>>>> a écrit :
> >     >
> >     >                                                  >
> >     >                                                  >             Hi
> Reuven,
> >     >                                                  >
> >     >                                                  >
>  Thanks for the
> >     >                                             update ! As I'm
> working with
> >     you on
> >     >                                                 this, I fully
> >     >                                                  >
>  agree and great
> >     >                                                  >             doc
> >     gathering the
> >     >                                             ideas.
> >     >                                                  >
> >     >                                                  >
>  It's clearly
> >     >                                             something we have to
> add asap
> >     in Beam,
> >     >                                                 because it would
> >     >                                                  >
>  allow new
> >     >                                                  >             use
> cases
> >     for our
> >     >                                             users (in a simple
> way) and open
> >     >                                                 new areas for the
> >     >                                                  >
>  runners
> >     >                                                  >
>  (for instance
> >     >                                             dataframe support in
> the Spark
> >     runner).
> >     >                                                  >
> >     >                                                  >             By
> the way,
> >     while
> >     >                                             ago, I created
> BEAM-3437 to