You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Camuel Gilyadov <ca...@gmail.com> on 2012/10/14 21:00:01 UTC

schema language for Drill

We need to settle down for schema language for the project. I know there is
a lot of good will to support many schema languages and data formats and
also schemaless usecase. No problem with that. But let's go modular by
creating many compact single-purpose components which then could be
connected in various combinations to produce many different useful systems
each optimized for different scenario. And not all components must be used
in all combinations of course. However, I suggest, first producing a
complete path optimized for one scenario and only then extending it by
adding new components.

So in this context let's settle down for one schema language and one data
format. I strongly oppose the case of starting from "schemaless" usecase,
like drilling a loose set of json documents. And the reason is that
"schemaless" datasets contain in fact way too much schema information
within the dataset proper itself, worse off, it is a partial schema with
the other parts of schema loosely scattered across parsers and often in
hard-coded ugly imperative form. This is just so against Dremel approach
that goes so far as to encode all data into columnar form and then compress
it in the way that predicate evaluation could be done before decompression
and so on. We can add later the much needed "drilling loose pile of JSON
documents" usecase but since it is not a typical usecase let's not start
the project from it. The irony here is that truly schemaless dataset is
those that has schema supplied separately and therefore could afford to
contain zero schema information within the dataset proper.

So back to the schema language. I suggest sticking to .proto files and
having protobuf as initial "standard" data format. The Drill would become
initially a query system for protobuf data. The schema will be expressed in
.proto formats, DrQL queries would be validated against it and for each
result .proto schema would be supplied. This is also would the case of
internal data interchange. .proto schema will have initially two encodings:
the usual binary hierarchical and the "dremel" binary columnar. But with
any encoding it is exactly the same schema and data could be converted
between encodings without loss.

If not protobuf then we have several other formats that support the concept
of separate schema - like AVRO, THRIFT or oldies like XML and ASN1. I more
familiar with protobuf and avro and among these two I
strongly favor protobuf (OpenDremel uses AVRO and while it worked great,
schema language IS WAY TOO CRYPTIC and this is the only reason
I disfavor AVRO). I don't know Thrift and XML and ASN1 is so uncool now :)
that no one would bother. So from what I know I strongly suggest protobuf.

As I said it is not a life or death question, just a question from which
format we start coding... and therefore team experience does count.

What you think?

Re: schema language for Drill

Posted by Ted Dunning <te...@gmail.com>.
I am currently using the SSA parser that I just committed to build a simple
interpreter.  That interpreter will handle JSON first, but will have hooks
for passing schema beside the data.  Essentially the only difference
between protobufs or columnar data and JSON is that in the former two
cases, information derived from the schema will be cacheable and in the
JSON case, it will not be.  Starting with JSON first is easy since I don't
have to worry about getting the caching right.

On Mon, Oct 15, 2012 at 7:37 AM, Jason Frantz <jf...@maprtech.com> wrote:

> I agree with Camuel that, compared to querying JSON, querying a columnar
> Dremel-like format will be significantly faster. Also, a lot of
> "schemaless" data has an implicit schema, so supplying the schema out of
> band can reduce the processing overhead (this is what it looks like
> BigQuery recently did to handle JSON).
>
> That said, I think there are two big benefits of starting out by tackling
> JSON. First off, JSON is the easiest to integrate with existing data sets
> and there's no storage format to convert to. Secondly, I think JSON
> exercises a wider set of issues than data with a well-defined schema such
> that it will be much harder to adapt a protobuf-based system to handle
> JSON. For example, using something like LLVM to compile JSON processing
> code is a fairly poor fit since processing every value needs a large switch
> to handle all the potential types.
>
> If other people would prefer to go down the schema route first, I agree
> with Camuel about starting with protobuf and adopting two formats: a binary
> row-based format and a binary Dremel-like columnar format.
>
> -Jason
>
> On Sun, Oct 14, 2012 at 12:00 PM, Camuel Gilyadov <ca...@gmail.com>
> wrote:
>
> > We need to settle down for schema language for the project. I know there
> is
> > a lot of good will to support many schema languages and data formats and
> > also schemaless usecase. No problem with that. But let's go modular by
> > creating many compact single-purpose components which then could be
> > connected in various combinations to produce many different useful
> systems
> > each optimized for different scenario. And not all components must be
> used
> > in all combinations of course. However, I suggest, first producing a
> > complete path optimized for one scenario and only then extending it by
> > adding new components.
> >
> > So in this context let's settle down for one schema language and one data
> > format. I strongly oppose the case of starting from "schemaless" usecase,
> > like drilling a loose set of json documents. And the reason is that
> > "schemaless" datasets contain in fact way too much schema information
> > within the dataset proper itself, worse off, it is a partial schema with
> > the other parts of schema loosely scattered across parsers and often in
> > hard-coded ugly imperative form. This is just so against Dremel approach
> > that goes so far as to encode all data into columnar form and then
> compress
> > it in the way that predicate evaluation could be done before
> decompression
> > and so on. We can add later the much needed "drilling loose pile of JSON
> > documents" usecase but since it is not a typical usecase let's not start
> > the project from it. The irony here is that truly schemaless dataset is
> > those that has schema supplied separately and therefore could afford to
> > contain zero schema information within the dataset proper.
> >
> > So back to the schema language. I suggest sticking to .proto files and
> > having protobuf as initial "standard" data format. The Drill would become
> > initially a query system for protobuf data. The schema will be expressed
> in
> > .proto formats, DrQL queries would be validated against it and for each
> > result .proto schema would be supplied. This is also would the case of
> > internal data interchange. .proto schema will have initially two
> encodings:
> > the usual binary hierarchical and the "dremel" binary columnar. But with
> > any encoding it is exactly the same schema and data could be converted
> > between encodings without loss.
> >
> > If not protobuf then we have several other formats that support the
> concept
> > of separate schema - like AVRO, THRIFT or oldies like XML and ASN1. I
> more
> > familiar with protobuf and avro and among these two I
> > strongly favor protobuf (OpenDremel uses AVRO and while it worked great,
> > schema language IS WAY TOO CRYPTIC and this is the only reason
> > I disfavor AVRO). I don't know Thrift and XML and ASN1 is so uncool now
> :)
> > that no one would bother. So from what I know I strongly suggest
> protobuf.
> >
> > As I said it is not a life or death question, just a question from which
> > format we start coding... and therefore team experience does count.
> >
> > What you think?
> >
>

Re: schema language for Drill

Posted by Jason Frantz <jf...@maprtech.com>.
I agree with Camuel that, compared to querying JSON, querying a columnar
Dremel-like format will be significantly faster. Also, a lot of
"schemaless" data has an implicit schema, so supplying the schema out of
band can reduce the processing overhead (this is what it looks like
BigQuery recently did to handle JSON).

That said, I think there are two big benefits of starting out by tackling
JSON. First off, JSON is the easiest to integrate with existing data sets
and there's no storage format to convert to. Secondly, I think JSON
exercises a wider set of issues than data with a well-defined schema such
that it will be much harder to adapt a protobuf-based system to handle
JSON. For example, using something like LLVM to compile JSON processing
code is a fairly poor fit since processing every value needs a large switch
to handle all the potential types.

If other people would prefer to go down the schema route first, I agree
with Camuel about starting with protobuf and adopting two formats: a binary
row-based format and a binary Dremel-like columnar format.

-Jason

On Sun, Oct 14, 2012 at 12:00 PM, Camuel Gilyadov <ca...@gmail.com> wrote:

> We need to settle down for schema language for the project. I know there is
> a lot of good will to support many schema languages and data formats and
> also schemaless usecase. No problem with that. But let's go modular by
> creating many compact single-purpose components which then could be
> connected in various combinations to produce many different useful systems
> each optimized for different scenario. And not all components must be used
> in all combinations of course. However, I suggest, first producing a
> complete path optimized for one scenario and only then extending it by
> adding new components.
>
> So in this context let's settle down for one schema language and one data
> format. I strongly oppose the case of starting from "schemaless" usecase,
> like drilling a loose set of json documents. And the reason is that
> "schemaless" datasets contain in fact way too much schema information
> within the dataset proper itself, worse off, it is a partial schema with
> the other parts of schema loosely scattered across parsers and often in
> hard-coded ugly imperative form. This is just so against Dremel approach
> that goes so far as to encode all data into columnar form and then compress
> it in the way that predicate evaluation could be done before decompression
> and so on. We can add later the much needed "drilling loose pile of JSON
> documents" usecase but since it is not a typical usecase let's not start
> the project from it. The irony here is that truly schemaless dataset is
> those that has schema supplied separately and therefore could afford to
> contain zero schema information within the dataset proper.
>
> So back to the schema language. I suggest sticking to .proto files and
> having protobuf as initial "standard" data format. The Drill would become
> initially a query system for protobuf data. The schema will be expressed in
> .proto formats, DrQL queries would be validated against it and for each
> result .proto schema would be supplied. This is also would the case of
> internal data interchange. .proto schema will have initially two encodings:
> the usual binary hierarchical and the "dremel" binary columnar. But with
> any encoding it is exactly the same schema and data could be converted
> between encodings without loss.
>
> If not protobuf then we have several other formats that support the concept
> of separate schema - like AVRO, THRIFT or oldies like XML and ASN1. I more
> familiar with protobuf and avro and among these two I
> strongly favor protobuf (OpenDremel uses AVRO and while it worked great,
> schema language IS WAY TOO CRYPTIC and this is the only reason
> I disfavor AVRO). I don't know Thrift and XML and ASN1 is so uncool now :)
> that no one would bother. So from what I know I strongly suggest protobuf.
>
> As I said it is not a life or death question, just a question from which
> format we start coding... and therefore team experience does count.
>
> What you think?
>