You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by weijie tong <to...@gmail.com> on 2018/08/16 00:01:52 UTC

[DISCUSSION] Does schema-free really need

Hi all:
  Hope the statement not seems too dash to you.
  Drill claims be a schema-free distributed SQL engine. It pays lots of
work to make the execution engine to support it to support JSON file like
storage format. It is easier to make bugs and let the code logic ugly. I
wonder do we still insist on this ,since we are designing the metadata
system with DRILL-6552.
   Traditionally, people is used to design its table schema firstly before
firing a SQL query. I don't think this saves people too much time. Other
system like Spark is popular not due to lack the schema claiming. I think
we should be brave enough to take the right decision whether to still
insist on this feature which seems not so important but a burden.
    Thanks.

Re: [DISCUSSION] Does schema-free really need

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Join the club. We all forget that from time to time - until a perfectly simple change blows up in some huge QA test...

- Paul

Sent from my iPhone

> On Aug 22, 2018, at 8:33 AM, Chris Cunningham <cu...@gmail.com> wrote:
> 
> Hmm.  Right. Somehow I forgot about the very large result sets - stupid of
> me.
> 
> On Tue, Aug 21, 2018 at 4:43 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> Hi Chris,
>> 
> <snip>
> 
>> Later, when Drill sees the first Varchar, it can change the type from,
>> say, batch 3 onwards. But, JDBC and ODBC generally require the schema be
>> known up front, so they would have to predict the future to know that a
>> Varchar will eventually appear.
> 
> +
> 
>> Within Drill, suppose that the query includes a sort, and that memory
>> limits require spilling. The first two batches with just null will be
>> spilled to disk in one format. Third and later batches have a different
>> schema. So, the code that handles spilling must merge the two types. Not a
>> big problem, but the fix must be applied in multiple places in different
>> ways. Very difficult to test all the resulting combinations and
>> permutations.
>> 
> 
> IF drill is doing a sort, it is likely that Drill would know the types
> before any rows were returned to JDBC/ODBC, so in this case delaying
> telling the client what type of columns they are getting could work, right?
> Of course, many queries will avoid sorts, so this isn't really an answer.
> Maybe a partial one.
> 
> Does this make sense? Are we overlooking an alternative solution?
>> 
> 
> It does make sense.
> 
> And I can't see a reasonable alternative.  The closest I come is telling
> the client that it is a BLOB of unknown size (or make it roughly 'big').
> Then either NULL or a pointer to the data is returned - but this just
> pushes the actual determination of the type to the user - with much less
> help context than Drill would have for finding out what it really is.
> I.e., not really good.
> 
> IF JDBC/ODBC was enhanced with a new feature that allowed the server to say
> in essence 'Hey, I know I told you what the columns where, but really, this
> column has changed to this type' in the middle of the results, that would
> be nice.
> 
> ODBC does allow the user to unbind columns and rebind them - so it would be
> conceivably possible that Drill could raise a warning, the client could not
> the warning said something like 'Column 17 changed to varchar(2000)', and
> then the client unbinds column 17, and rebinds it to a new buffer that
> would hold the actual content.
> (
> https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlbindcol-function?view=sql-server-2017
> )
> 
> Of course, this would require users to custom code ODBC access to drill to
> take advantage of this - which I suspect would be pretty uncommon.  I also
> see no reference to this ability in JDBC.
> 
> Thanks,
>> - Paul
>> 
> 
> Thank you for the quick, detailed response!
> -Chris


Re: [DISCUSSION] Does schema-free really need

Posted by Chris Cunningham <cu...@gmail.com>.
Hmm.  Right. Somehow I forgot about the very large result sets - stupid of
me.

On Tue, Aug 21, 2018 at 4:43 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Chris,
>
<snip>

> Later, when Drill sees the first Varchar, it can change the type from,
> say, batch 3 onwards. But, JDBC and ODBC generally require the schema be
> known up front, so they would have to predict the future to know that a
> Varchar will eventually appear.

+

> Within Drill, suppose that the query includes a sort, and that memory
> limits require spilling. The first two batches with just null will be
> spilled to disk in one format. Third and later batches have a different
> schema. So, the code that handles spilling must merge the two types. Not a
> big problem, but the fix must be applied in multiple places in different
> ways. Very difficult to test all the resulting combinations and
> permutations.
>

IF drill is doing a sort, it is likely that Drill would know the types
before any rows were returned to JDBC/ODBC, so in this case delaying
telling the client what type of columns they are getting could work, right?
Of course, many queries will avoid sorts, so this isn't really an answer.
Maybe a partial one.

Does this make sense? Are we overlooking an alternative solution?
>

It does make sense.

And I can't see a reasonable alternative.  The closest I come is telling
the client that it is a BLOB of unknown size (or make it roughly 'big').
Then either NULL or a pointer to the data is returned - but this just
pushes the actual determination of the type to the user - with much less
help context than Drill would have for finding out what it really is.
I.e., not really good.

IF JDBC/ODBC was enhanced with a new feature that allowed the server to say
in essence 'Hey, I know I told you what the columns where, but really, this
column has changed to this type' in the middle of the results, that would
be nice.

ODBC does allow the user to unbind columns and rebind them - so it would be
conceivably possible that Drill could raise a warning, the client could not
the warning said something like 'Column 17 changed to varchar(2000)', and
then the client unbinds column 17, and rebinds it to a new buffer that
would hold the actual content.
(
https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlbindcol-function?view=sql-server-2017
)

Of course, this would require users to custom code ODBC access to drill to
take advantage of this - which I suspect would be pretty uncommon.  I also
see no reference to this ability in JDBC.

Thanks,
> - Paul
>

Thank you for the quick, detailed response!
-Chris

Re: [DISCUSSION] Does schema-free really need

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Chris,

Great suggestion. As it turns out, however, when we've looked at this in the past, it simply shifts the problem from one place to another. I'll walk though the issues because it is helpful for us all to understand what we're up against.

Consider a single file with 100K nulls, followed by 150K Varchars. Drill works in batches and will send a series of, say, 5 batches to the client (each with 50K records). The client will start by seeing 2 batches of nulls. The app using JDBC or ODBC, needs to know the type to know how to allocate memory for the data. But, what type?

Later, when Drill sees the first Varchar, it can change the type from, say, batch 3 onwards. But, JDBC and ODBC generally require the schema be known up front, so they would have to predict the future to know that a Varchar will eventually appear.

Drill could buffer the result set, waiting for the schema change. Once it occurs, Drill could rewrite the prior data to the newly discovered type. But, since Drill has no idea when (or if) a schema change will occur, and data could be unlimited in size, Drill would have to buffer the entire result set, impacting latency since, in the general case, the entire result set would be spilled to disk, then read back, before the first records are sent to the client.

JDBC, at least, allows the idea of multiple result sets (used for stored procedures.) Drill could send a first result set with NULL as the type, then a second when the type changed to Varchar. This then pushes the schema change to the app using JDBC, but most are not set up to handle multiple result sets. (Indeed, what would the app do if it requires tabular data with consistent types?)

Within Drill, suppose that the query includes a sort, and that memory limits require spilling. The first two batches with just null will be spilled to disk in one format. Third and later batches have a different schema. So, the code that handles spilling must merge the two types. Not a big problem, but the fix must be applied in multiple places in different ways. Very difficult to test all the resulting combinations and permutations.

Suppose that the file in question is a CSV file that can be split and we have, say, five concurrent readers, each reading 50K records.  Two readers read the nulls and shuffle these to other nodes. Three readers read the Varchars and shuffle the data to other nodes. The receiving nodes sees, effectively, a random arrival of data: some start with Varchar, followed by null, some the other way around. Again, we could write code to merge these types in each receiver.

But, if we scale up the file, we may find that some receivers go many batches of nulls before they see a Varchar, which pushes the problem up the DAG, eventually to the client, as described above.

We seem to be forced into the conclusion that the only solution is to predict the future (so we know to use the Varchar in the reader, even when we see nulls from the file). The only way to do that (without access to time travel) is for metadata to tell us that, "data for this column, when it eventually appears, will be Varchar."

Does this make sense? Are we overlooking an alternative solution?

Thanks,
- Paul

 

    On Tuesday, August 21, 2018, 3:45:55 PM PDT, Chris Cunningham <cu...@gmail.com> wrote:  
 
 Hi.  Mostly off topic, but reading about this issue has finally prompted a
response.

On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:
<snip>

> If we provide schema hints ("field x, when it appears, will be a Double"),
> then Drill need not fight with the problem that the data will be Double in
> some files, Nullable Int in others. The schema hint removes the ambiguity.
>
</snip>

How hard would it be to have drill 'fix' the null Int problem by just
noting it is nullable with no other type information assumed until it
encounters that type, and then require that type from then on?  If an
entire file has only null's, then it doesn't define the field at all - only
files that have the type define the field, and then when they are combined
later, IF there are conflicting non-Null fields will the error be thrown?

Thanks,
Chris
  

Re: [DISCUSSION] Does schema-free really need

Posted by Chris Cunningham <cu...@gmail.com>.
Hi.  Mostly off topic, but reading about this issue has finally prompted a
response.

On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:
<snip>

> If we provide schema hints ("field x, when it appears, will be a Double"),
> then Drill need not fight with the problem that the data will be Double in
> some files, Nullable Int in others. The schema hint removes the ambiguity.
>
</snip>

How hard would it be to have drill 'fix' the null Int problem by just
noting it is nullable with no other type information assumed until it
encounters that type, and then require that type from then on?  If an
entire file has only null's, then it doesn't define the field at all - only
files that have the type define the field, and then when they are combined
later, IF there are conflicting non-Null fields will the error be thrown?

Thanks,
Chris

Re: [DISCUSSION] Does schema-free really need

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Weijie,

Thanks for raising this topic. I think you've got a great suggestion.

My two cents: there is no harm in reading all manner of ugly data. But, rather than try to process the mess throughout Drill (as we do today with schema changes, just-in-time code generation, union vectors and the rest), simply require that the user provide disambiguation rules prior to running the query. Require that the reader convert input data into a clean relational format.

That is, rather than try to be schema-less, just be schema-on-read, which, as you point out, is what most users probably want anyway.


JSON is my favorite example of the current issues. JSON is a general-purpose data format. SQL is a relational format. IMHO, if JDBC and ODBC are Drill's primary APIs, data must fit those APIs, and that means data must be relational. So, the goal when using JDBC is to map the JDBC structure into a relational structure.

Since Drill now supports arrays of nested tuples via implicit joins, "relational format" here means the extended relational format. Perhaps a requirement is that arrays MUST be flattened to simple tuples prior to returning results to JDBC or ODBC. 


If we provide schema hints ("field x, when it appears, will be a Double"), then Drill need not fight with the problem that the data will be Double in some files, Nullable Int in others. The schema hint removes the ambiguity.

The result is that Drill can read any type of data. But, as you say,  the Drill internals are simpler, cleaner and faster. There is no ambiguity about types that Drill kinda-sorta supports but that ODBC/JDBC don't support. Effort can go into new features rather than fighting an unwinnable battle to use non-relational data in a relational engine.


In short, as part of the metadata API work, perhaps define how metadata can simplify Drill internals. Ensure that users can create simple metadata hint files for ad-hoc use, maybe as an extension to view files. Then, push the problem of messy schemas into a conversion layer in the reader and out of the rest of the Drill execution engine.

The reason I keep jumping into these metadata discussions is that I encountered the ambiguity problems first hand in the "result set loader" work. There are ambiguities in JSON that simply cannot be resolved until Drill can predict the future (by having access to metadata.)

Consider the query "SELECT a, x FROM foo.json". Column x is missing from the first 1000 records. In the 1001st record, it shows up and is a Double. Previously, Drill would guess Nullable Int on the first column, then blow up when x appears as a Double. A revision was to postpone picking a type as late as possible so that, if column x does show up in the first batch, ambiguity is avoided. But, this trick does not work if the column shows up in the second or later batch. We need a column in the first batch, and we will guess Nullable Int. Depending on the query, this will result in a schema change elsewhere in the DAG or in the client.

Similarly, in the new version, JSON can handle data of the form {x: 10.1} {x: 10} because it can convert an Int to a Double. Sadly, {x: 10} {x: 10.1} still fails because Drill can't convert a Double to an Int. Sigh...


To implement these tricks, the revised reader framework accepts a schema and automagically does the work of converting data from the input type to the defined type. Once this code is into master (we are still about three PRs away), it can be combined with the metadata system to achieve the "apply schema on read" idea discussed above. We then would not need the horrible hacks like those just discussed.


Thanks,
- Paul

 

    On Wednesday, August 15, 2018, 5:02:08 PM PDT, weijie tong <to...@gmail.com> wrote:  
 
 Hi all:
  Hope the statement not seems too dash to you.
  Drill claims be a schema-free distributed SQL engine. It pays lots of
work to make the execution engine to support it to support JSON file like
storage format. It is easier to make bugs and let the code logic ugly. I
wonder do we still insist on this ,since we are designing the metadata
system with DRILL-6552.
  Traditionally, people is used to design its table schema firstly before
firing a SQL query. I don't think this saves people too much time. Other
system like Spark is popular not due to lack the schema claiming. I think
we should be brave enough to take the right decision whether to still
insist on this feature which seems not so important but a burden.
    Thanks.
  

Re: [DISCUSSION] Does schema-free really need

Posted by Charles Givre <cg...@gmail.com>.
I like where Ted is going with this.  I do also like Paul’s idea of giving Drill hints as to the schema.  
— C

> On Aug 15, 2018, at 20:34, Ted Dunning <te...@gmail.com> wrote:
> 
> This is a bold statement.
> 
> And there are variants of it that could give users nearly the same
> experience that we have now. For instance, if we cache discovered schemas
> for old files and discover the schema for any new file that we see (and
> cache it) before actually running a query. That gives us pretty much the
> flexibility of schema on read without as much of the burden.
> 
> 
> 
> On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com> wrote:
> 
>> Hi all:
>>  Hope the statement not seems too dash to you.
>>  Drill claims be a schema-free distributed SQL engine. It pays lots of
>> work to make the execution engine to support it to support JSON file like
>> storage format. It is easier to make bugs and let the code logic ugly. I
>> wonder do we still insist on this ,since we are designing the metadata
>> system with DRILL-6552.
>>   Traditionally, people is used to design its table schema firstly before
>> firing a SQL query. I don't think this saves people too much time. Other
>> system like Spark is popular not due to lack the schema claiming. I think
>> we should be brave enough to take the right decision whether to still
>> insist on this feature which seems not so important but a burden.
>>    Thanks.
>> 


Re: [DISCUSSION] Does schema-free really need

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Thanks for the background Jinfeng, your explanation brings us back to the topic Arina raised: the state and direction of the Drill project.

For several years now, Drill has said, essentially, "dabble with raw data all you like early in the project, but for production, ETL your data into Parquet." One view is that we double-down on this idea. Parquet has many advantages: it carries its own schema. As Jinfeng noted, it is the product of an ETL process that cleans up and normalizes data, removing the variations that creep in during schema evolution.

Further, Drill already has Ted's mechanism to pre-scan the data: Drill does it to capture the Parquet footer and directory metadata. There is a not-yet-committed project to gather stats on Parquet files. Parquet allows Drill to be schema-free (yes, a marketing term, really schema-on-read) with a good, solid schema as defined in Parquet. Of course, even Parquet is subject to ambiguities due to newly added columns, but ETL should clean up such issues.

Extend that to other types: "schema-free" could mean, "Drill does not do schemas; each data source must provide its own CONSISTENT schema." Parquet does this today as does Hive. Allow external hints for CSV and JSON. Require that CSV and JSON have clean, extended relational schemas to avoid messy ambiguities and schema changes.

An argument can be made that many fine ETL tools exist: NiFi, Spark, MR, Hive, StreamSets, ... Drill need not try to solve this problem. Further, those other products generally allow the user to use code to handle tough cases; Drill only has SQL, and often SQL is simply not sufficiently expressive. Here is a favorite example:

{fields: [ {id: 1, name: "customer-name", type: "string", value: "fred"}, {id: 2, name: "balance", type: "float", value: "123.45"}] }

It is unlikely that Drill can add enough UDFs to parse the above, and to do so efficiently. But, parsing the above in code (in Spark, say) is easy.

So, perhaps Drill becomes a Parquet-focused query engine, requiring unambiguous schema, defined by the data source itself (which includes, for Parquet, Drill's metadata and stats files.) Drill is "schema-free", but the data is required to provide a clean, unambiguous schema.

The only problem is that this big data niche is already occupied by an entrenched leader called Impala. Impala is faster than Drill (even if it is hard to build and is very hard to tune.) Drill can't win by saying, "we're almost as fast, have fewer bugs, and are easier to build." Spark didn't win by being almost as good as MR, it won because it was far better on many dimensions. Marketing 101 says that a product can never will a head-to-head battle with an entrenched leader.

So, what niche can Drill fill that is not already occupied? Spark is great for complex data transforms of the kind shown above, but is not good at all for interactive SQL. (Spark is not a persistent engine, it is a batch system like MR. As noted in an earlier message, Spark shuffles data in stages, not in batches like Drill and Impala.) But, Spark has a huge community; maybe someone will solve these issues. So, Drill as a better Spark is not appealing.

One idea that has come up from time-to-time is Drill as an open source Splunk-like tool. Splunk can ingest zillions of file formats using adapters, akin to Drill's storage and format plugins. Ingesting arbitrary files requires some schema cleansing on read. Drill's answer could be to add ad-hoc metadata, allow data-cleaning plugins, and add UDFs. That is, double-down on the idea that Drill does read multiple formats; solve the remaining issues to do so well.

In short, the big question is, "what does Drill want to do now that its grown up?" Compete with Impala (Parquet only)? Complete with Spark (better code-based query engine)? Compete with Splunk (query any file format)? Something else?

Whatever we do, to Weijie's point, we should do it in a way that is stable: today's approach to handling messy schema's can't ever work completely because it requires that Drill predict the future: a reader must decide on record 1 how to handle a field that won't actually appear until file (or block) 100. Do we need that? How do we maintain code (union vectors, list vectors, schema change) that never worked and probably never can? What is the better solution?

Thanks,
- Paul

 

    On Wednesday, August 15, 2018, 11:15:10 PM PDT, Jinfeng Ni <jn...@apache.org> wrote:  
 
 The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we can decide what's the best strategy going forward.

In examples Paul listed, why would two sets of data have different schema?
In many cases, that's because application generating the data is changed;
either adding/deleting one field, or modifying one existing field.  ETL is
a typical approach to clean up such data with different schema.  Drill's
argument, couple of years ago when the project was started, was that ETL is
too time-consuming.  it would provide great value if a query engine could
query directly against such datasets.

I feel Paul's suggestion of letting user provide schema, or Drill
scan/probe and learn the schema seems to fall in the middle of spectrum;
ETL is one extreme, and Drill's current schema-on-read is the other
extreme.  Personally, I would prefer letting Drill scan/probe the schema,
as it might not be easy for user to provide schema in the case of nested
data (will they have to provide type information for any nested field?).

To Weijie's comment about complexity of code of dealing schema, in theory
we should refactor/rewrite majority run-time operator, separating the logic
of handling schema and handling regular data flow.  That would clean up the
current mess.

ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
Drill would be schema-on-read.
    2:  I would not call it a battle between non-relational data and
relational engine. The extended relational model has type of
array/composite types, similar to what Drill has.





On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <to...@gmail.com>
wrote:

> @Paul I really appreciate the statement ` Effort can go into new features
> rather than fighting an unwinnable battle to use non-relational data in a
> relational engine.` .
>
> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
> Drill to support most of our analysis work. Our business and data is
> complex enough. Our strategy is to let users design their schema first,
> then dump in their data , query their data later. This work flow runs
> fluently.  But by deep inside into the Drill's code internal and see the
> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
> change but really no help to most of the actual use case. I think this also
> make the storage plugin interface not so intuitive to implement.
>
> We are sacrificing most of our work to pay for little income. Users really
> don't care about defining a schema first, but pay attention whether their
> query is fast enough. By probing the data to guess the schema and cache
> them , to me ,is a compromise strategy but still not clean enough. So I
> hope we move the mess schema solving logic out of Drill to let the code
> cleaner by defining the schema firstly with DDL statements. If we agree on
> this, the work should be a sub work of DRILL-6552.
>
> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Ted,
> >
> > I like the "schema auto-detect" idea.
> >
> > As we discussed in a prior thread, caching of schema is a nice-add on
> once
> > we have defined the schema-on-read mechanism. Maybe we first get it to
> work
> > with a user-provided schema. Then, as an enhancement, we offer to infer
> the
> > schema by scanning data.
> >
> > There are some ambiguities that schema inference can't resolve: in {x:
> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >
> > Still if Drill could provide a guess at the schema, and the user could
> > refine it, we'd have a very elegant solution.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> > ted.dunning@gmail.com> wrote:
> >
> >  This is a bold statement.
> >
> > And there are variants of it that could give users nearly the same
> > experience that we have now. For instance, if we cache discovered schemas
> > for old files and discover the schema for any new file that we see (and
> > cache it) before actually running a query. That gives us pretty much the
> > flexibility of schema on read without as much of the burden.
> >
> >
> >
> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > Hi all:
> > >  Hope the statement not seems too dash to you.
> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > > work to make the execution engine to support it to support JSON file
> like
> > > storage format. It is easier to make bugs and let the code logic ugly.
> I
> > > wonder do we still insist on this ,since we are designing the metadata
> > > system with DRILL-6552.
> > >    Traditionally, people is used to design its table schema firstly
> > before
> > > firing a SQL query. I don't think this saves people too much time.
> Other
> > > system like Spark is popular not due to lack the schema claiming. I
> think
> > > we should be brave enough to take the right decision whether to still
> > > insist on this feature which seems not so important but a burden.
> > >    Thanks.
> > >
> >
>
  

Re: [DISCUSSION] Does schema-free really need

Posted by weijie tong <to...@gmail.com>.
I think there's no schema-free data. To one ad-hoc query, one file, its
schema is already defined. The schema is just discovered by the Drill not
defined the user explicitly now.

On Thu, Aug 16, 2018 at 2:29 PM Jinfeng Ni <jn...@apache.org> wrote:

> btw:  In one project that I'm currently working (an application related to
> IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
> to predefine table DDL.
>
>
> On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <jn...@apache.org> wrote:
>
> > The use case Weijie described seems to fall into the category of
> > traditional data warehouse, i.e, schemas are predefined by users, data
> > strictly conforms to schema. Certainly this is one important uses, and I
> > agreed that the schema-on-read logic in Drill run-time indeed  is a
> > disadvantage for such use case, compared with other SQL query engine like
> > Impala/Presto.
> >
> > The question we want to ask is whether that's the only use case Drill
> > wants to target. We probably want to hear more cases from Drill
> community,
> > before we can decide what's the best strategy going forward.
> >
> > In examples Paul listed, why would two sets of data have different
> schema?
> > In many cases, that's because application generating the data is changed;
> > either adding/deleting one field, or modifying one existing field.  ETL
> is
> > a typical approach to clean up such data with different schema.  Drill's
> > argument, couple of years ago when the project was started, was that ETL
> is
> > too time-consuming.  it would provide great value if a query engine could
> > query directly against such datasets.
> >
> > I feel Paul's suggestion of letting user provide schema, or Drill
> > scan/probe and learn the schema seems to fall in the middle of spectrum;
> > ETL is one extreme, and Drill's current schema-on-read is the other
> > extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> > as it might not be easy for user to provide schema in the case of nested
> > data (will they have to provide type information for any nested field?).
> >
> > To Weijie's comment about complexity of code of dealing schema, in theory
> > we should refactor/rewrite majority run-time operator, separating the
> logic
> > of handling schema and handling regular data flow.  That would clean up
> the
> > current mess.
> >
> > ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> > Drill would be schema-on-read.
> >     2:  I would not call it a battle between non-relational data and
> > relational engine. The extended relational model has type of
> > array/composite types, similar to what Drill has.
> >
> >
> >
> >
> >
> > On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <to...@gmail.com>
> > wrote:
> >
> >> @Paul I really appreciate the statement ` Effort can go into new
> features
> >> rather than fighting an unwinnable battle to use non-relational data in
> a
> >> relational engine.` .
> >>
> >> At AntFinancial( known as Alipay  an Alibaba related company ) we now
> use
> >> Drill to support most of our analysis work. Our business and data is
> >> complex enough. Our strategy is to let users design their schema first,
> >> then dump in their data , query their data later. This work flow runs
> >> fluently.  But by deep inside into the Drill's code internal and see the
> >> JIRA bugs, we will see most of the non-intuitive codes to solve the
> schema
> >> change but really no help to most of the actual use case. I think this
> >> also
> >> make the storage plugin interface not so intuitive to implement.
> >>
> >> We are sacrificing most of our work to pay for little income. Users
> really
> >> don't care about defining a schema first, but pay attention whether
> their
> >> query is fast enough. By probing the data to guess the schema and cache
> >> them , to me ,is a compromise strategy but still not clean enough. So I
> >> hope we move the mess schema solving logic out of Drill to let the code
> >> cleaner by defining the schema firstly with DDL statements. If we agree
> on
> >> this, the work should be a sub work of DRILL-6552.
> >>
> >> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <pa...@yahoo.com.invalid>
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > I like the "schema auto-detect" idea.
> >> >
> >> > As we discussed in a prior thread, caching of schema is a nice-add on
> >> once
> >> > we have defined the schema-on-read mechanism. Maybe we first get it to
> >> work
> >> > with a user-provided schema. Then, as an enhancement, we offer to
> infer
> >> the
> >> > schema by scanning data.
> >> >
> >> > There are some ambiguities that schema inference can't resolve: in {x:
> >> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >> >
> >> > Still if Drill could provide a guess at the schema, and the user could
> >> > refine it, we'd have a very elegant solution.
> >> >
> >> >
> >> > Thanks,
> >> > - Paul
> >> >
> >> >
> >> >
> >> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> >> > ted.dunning@gmail.com> wrote:
> >> >
> >> >  This is a bold statement.
> >> >
> >> > And there are variants of it that could give users nearly the same
> >> > experience that we have now. For instance, if we cache discovered
> >> schemas
> >> > for old files and discover the schema for any new file that we see
> (and
> >> > cache it) before actually running a query. That gives us pretty much
> the
> >> > flexibility of schema on read without as much of the burden.
> >> >
> >> >
> >> >
> >> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi all:
> >> > >  Hope the statement not seems too dash to you.
> >> > >  Drill claims be a schema-free distributed SQL engine. It pays lots
> of
> >> > > work to make the execution engine to support it to support JSON file
> >> like
> >> > > storage format. It is easier to make bugs and let the code logic
> >> ugly. I
> >> > > wonder do we still insist on this ,since we are designing the
> metadata
> >> > > system with DRILL-6552.
> >> > >    Traditionally, people is used to design its table schema firstly
> >> > before
> >> > > firing a SQL query. I don't think this saves people too much time.
> >> Other
> >> > > system like Spark is popular not due to lack the schema claiming. I
> >> think
> >> > > we should be brave enough to take the right decision whether to
> still
> >> > > insist on this feature which seems not so important but a burden.
> >> > >    Thanks.
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSSION] Does schema-free really need

Posted by Jinfeng Ni <jn...@apache.org>.
btw:  In one project that I'm currently working (an application related to
IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
to predefine table DDL.


On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <jn...@apache.org> wrote:

> The use case Weijie described seems to fall into the category of
> traditional data warehouse, i.e, schemas are predefined by users, data
> strictly conforms to schema. Certainly this is one important uses, and I
> agreed that the schema-on-read logic in Drill run-time indeed  is a
> disadvantage for such use case, compared with other SQL query engine like
> Impala/Presto.
>
> The question we want to ask is whether that's the only use case Drill
> wants to target. We probably want to hear more cases from Drill community,
> before we can decide what's the best strategy going forward.
>
> In examples Paul listed, why would two sets of data have different schema?
> In many cases, that's because application generating the data is changed;
> either adding/deleting one field, or modifying one existing field.  ETL is
> a typical approach to clean up such data with different schema.  Drill's
> argument, couple of years ago when the project was started, was that ETL is
> too time-consuming.  it would provide great value if a query engine could
> query directly against such datasets.
>
> I feel Paul's suggestion of letting user provide schema, or Drill
> scan/probe and learn the schema seems to fall in the middle of spectrum;
> ETL is one extreme, and Drill's current schema-on-read is the other
> extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> as it might not be easy for user to provide schema in the case of nested
> data (will they have to provide type information for any nested field?).
>
> To Weijie's comment about complexity of code of dealing schema, in theory
> we should refactor/rewrite majority run-time operator, separating the logic
> of handling schema and handling regular data flow.  That would clean up the
> current mess.
>
> ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> Drill would be schema-on-read.
>     2:  I would not call it a battle between non-relational data and
> relational engine. The extended relational model has type of
> array/composite types, similar to what Drill has.
>
>
>
>
>
> On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <to...@gmail.com>
> wrote:
>
>> @Paul I really appreciate the statement ` Effort can go into new features
>> rather than fighting an unwinnable battle to use non-relational data in a
>> relational engine.` .
>>
>> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
>> Drill to support most of our analysis work. Our business and data is
>> complex enough. Our strategy is to let users design their schema first,
>> then dump in their data , query their data later. This work flow runs
>> fluently.  But by deep inside into the Drill's code internal and see the
>> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
>> change but really no help to most of the actual use case. I think this
>> also
>> make the storage plugin interface not so intuitive to implement.
>>
>> We are sacrificing most of our work to pay for little income. Users really
>> don't care about defining a schema first, but pay attention whether their
>> query is fast enough. By probing the data to guess the schema and cache
>> them , to me ,is a compromise strategy but still not clean enough. So I
>> hope we move the mess schema solving logic out of Drill to let the code
>> cleaner by defining the schema firstly with DDL statements. If we agree on
>> this, the work should be a sub work of DRILL-6552.
>>
>> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <pa...@yahoo.com.invalid>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > I like the "schema auto-detect" idea.
>> >
>> > As we discussed in a prior thread, caching of schema is a nice-add on
>> once
>> > we have defined the schema-on-read mechanism. Maybe we first get it to
>> work
>> > with a user-provided schema. Then, as an enhancement, we offer to infer
>> the
>> > schema by scanning data.
>> >
>> > There are some ambiguities that schema inference can't resolve: in {x:
>> > "1002"} {x: 1003}, should x be an Int or a Varchar?
>> >
>> > Still if Drill could provide a guess at the schema, and the user could
>> > refine it, we'd have a very elegant solution.
>> >
>> >
>> > Thanks,
>> > - Paul
>> >
>> >
>> >
>> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
>> > ted.dunning@gmail.com> wrote:
>> >
>> >  This is a bold statement.
>> >
>> > And there are variants of it that could give users nearly the same
>> > experience that we have now. For instance, if we cache discovered
>> schemas
>> > for old files and discover the schema for any new file that we see (and
>> > cache it) before actually running a query. That gives us pretty much the
>> > flexibility of schema on read without as much of the burden.
>> >
>> >
>> >
>> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com>
>> > wrote:
>> >
>> > > Hi all:
>> > >  Hope the statement not seems too dash to you.
>> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
>> > > work to make the execution engine to support it to support JSON file
>> like
>> > > storage format. It is easier to make bugs and let the code logic
>> ugly. I
>> > > wonder do we still insist on this ,since we are designing the metadata
>> > > system with DRILL-6552.
>> > >    Traditionally, people is used to design its table schema firstly
>> > before
>> > > firing a SQL query. I don't think this saves people too much time.
>> Other
>> > > system like Spark is popular not due to lack the schema claiming. I
>> think
>> > > we should be brave enough to take the right decision whether to still
>> > > insist on this feature which seems not so important but a burden.
>> > >    Thanks.
>> > >
>> >
>>
>
>

Re: [DISCUSSION] Does schema-free really need

Posted by Jinfeng Ni <jn...@apache.org>.
The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we can decide what's the best strategy going forward.

In examples Paul listed, why would two sets of data have different schema?
In many cases, that's because application generating the data is changed;
either adding/deleting one field, or modifying one existing field.  ETL is
a typical approach to clean up such data with different schema.  Drill's
argument, couple of years ago when the project was started, was that ETL is
too time-consuming.  it would provide great value if a query engine could
query directly against such datasets.

I feel Paul's suggestion of letting user provide schema, or Drill
scan/probe and learn the schema seems to fall in the middle of spectrum;
ETL is one extreme, and Drill's current schema-on-read is the other
extreme.  Personally, I would prefer letting Drill scan/probe the schema,
as it might not be easy for user to provide schema in the case of nested
data (will they have to provide type information for any nested field?).

To Weijie's comment about complexity of code of dealing schema, in theory
we should refactor/rewrite majority run-time operator, separating the logic
of handling schema and handling regular data flow.  That would clean up the
current mess.

ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
Drill would be schema-on-read.
    2:  I would not call it a battle between non-relational data and
relational engine. The extended relational model has type of
array/composite types, similar to what Drill has.





On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <to...@gmail.com>
wrote:

> @Paul I really appreciate the statement ` Effort can go into new features
> rather than fighting an unwinnable battle to use non-relational data in a
> relational engine.` .
>
> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
> Drill to support most of our analysis work. Our business and data is
> complex enough. Our strategy is to let users design their schema first,
> then dump in their data , query their data later. This work flow runs
> fluently.  But by deep inside into the Drill's code internal and see the
> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
> change but really no help to most of the actual use case. I think this also
> make the storage plugin interface not so intuitive to implement.
>
> We are sacrificing most of our work to pay for little income. Users really
> don't care about defining a schema first, but pay attention whether their
> query is fast enough. By probing the data to guess the schema and cache
> them , to me ,is a compromise strategy but still not clean enough. So I
> hope we move the mess schema solving logic out of Drill to let the code
> cleaner by defining the schema firstly with DDL statements. If we agree on
> this, the work should be a sub work of DRILL-6552.
>
> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Ted,
> >
> > I like the "schema auto-detect" idea.
> >
> > As we discussed in a prior thread, caching of schema is a nice-add on
> once
> > we have defined the schema-on-read mechanism. Maybe we first get it to
> work
> > with a user-provided schema. Then, as an enhancement, we offer to infer
> the
> > schema by scanning data.
> >
> > There are some ambiguities that schema inference can't resolve: in {x:
> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >
> > Still if Drill could provide a guess at the schema, and the user could
> > refine it, we'd have a very elegant solution.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> > ted.dunning@gmail.com> wrote:
> >
> >  This is a bold statement.
> >
> > And there are variants of it that could give users nearly the same
> > experience that we have now. For instance, if we cache discovered schemas
> > for old files and discover the schema for any new file that we see (and
> > cache it) before actually running a query. That gives us pretty much the
> > flexibility of schema on read without as much of the burden.
> >
> >
> >
> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > Hi all:
> > >  Hope the statement not seems too dash to you.
> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > > work to make the execution engine to support it to support JSON file
> like
> > > storage format. It is easier to make bugs and let the code logic ugly.
> I
> > > wonder do we still insist on this ,since we are designing the metadata
> > > system with DRILL-6552.
> > >    Traditionally, people is used to design its table schema firstly
> > before
> > > firing a SQL query. I don't think this saves people too much time.
> Other
> > > system like Spark is popular not due to lack the schema claiming. I
> think
> > > we should be brave enough to take the right decision whether to still
> > > insist on this feature which seems not so important but a burden.
> > >    Thanks.
> > >
> >
>

Re: [DISCUSSION] Does schema-free really need

Posted by weijie tong <to...@gmail.com>.
@Paul I really appreciate the statement ` Effort can go into new features
rather than fighting an unwinnable battle to use non-relational data in a
relational engine.` .

At AntFinancial( known as Alipay  an Alibaba related company ) we now use
Drill to support most of our analysis work. Our business and data is
complex enough. Our strategy is to let users design their schema first,
then dump in their data , query their data later. This work flow runs
fluently.  But by deep inside into the Drill's code internal and see the
JIRA bugs, we will see most of the non-intuitive codes to solve the schema
change but really no help to most of the actual use case. I think this also
make the storage plugin interface not so intuitive to implement.

We are sacrificing most of our work to pay for little income. Users really
don't care about defining a schema first, but pay attention whether their
query is fast enough. By probing the data to guess the schema and cache
them , to me ,is a compromise strategy but still not clean enough. So I
hope we move the mess schema solving logic out of Drill to let the code
cleaner by defining the schema firstly with DDL statements. If we agree on
this, the work should be a sub work of DRILL-6552.

On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> I like the "schema auto-detect" idea.
>
> As we discussed in a prior thread, caching of schema is a nice-add on once
> we have defined the schema-on-read mechanism. Maybe we first get it to work
> with a user-provided schema. Then, as an enhancement, we offer to infer the
> schema by scanning data.
>
> There are some ambiguities that schema inference can't resolve: in {x:
> "1002"} {x: 1003}, should x be an Int or a Varchar?
>
> Still if Drill could provide a guess at the schema, and the user could
> refine it, we'd have a very elegant solution.
>
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  This is a bold statement.
>
> And there are variants of it that could give users nearly the same
> experience that we have now. For instance, if we cache discovered schemas
> for old files and discover the schema for any new file that we see (and
> cache it) before actually running a query. That gives us pretty much the
> flexibility of schema on read without as much of the burden.
>
>
>
> On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com>
> wrote:
>
> > Hi all:
> >  Hope the statement not seems too dash to you.
> >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > work to make the execution engine to support it to support JSON file like
> > storage format. It is easier to make bugs and let the code logic ugly. I
> > wonder do we still insist on this ,since we are designing the metadata
> > system with DRILL-6552.
> >    Traditionally, people is used to design its table schema firstly
> before
> > firing a SQL query. I don't think this saves people too much time. Other
> > system like Spark is popular not due to lack the schema claiming. I think
> > we should be brave enough to take the right decision whether to still
> > insist on this feature which seems not so important but a burden.
> >    Thanks.
> >
>

Re: [DISCUSSION] Does schema-free really need

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ted,

I like the "schema auto-detect" idea.

As we discussed in a prior thread, caching of schema is a nice-add on once we have defined the schema-on-read mechanism. Maybe we first get it to work with a user-provided schema. Then, as an enhancement, we offer to infer the schema by scanning data.

There are some ambiguities that schema inference can't resolve: in {x: "1002"} {x: 1003}, should x be an Int or a Varchar?

Still if Drill could provide a guess at the schema, and the user could refine it, we'd have a very elegant solution.


Thanks,
- Paul

 

    On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <te...@gmail.com> wrote:  
 
 This is a bold statement.

And there are variants of it that could give users nearly the same
experience that we have now. For instance, if we cache discovered schemas
for old files and discover the schema for any new file that we see (and
cache it) before actually running a query. That gives us pretty much the
flexibility of schema on read without as much of the burden.



On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com> wrote:

> Hi all:
>  Hope the statement not seems too dash to you.
>  Drill claims be a schema-free distributed SQL engine. It pays lots of
> work to make the execution engine to support it to support JSON file like
> storage format. It is easier to make bugs and let the code logic ugly. I
> wonder do we still insist on this ,since we are designing the metadata
> system with DRILL-6552.
>    Traditionally, people is used to design its table schema firstly before
> firing a SQL query. I don't think this saves people too much time. Other
> system like Spark is popular not due to lack the schema claiming. I think
> we should be brave enough to take the right decision whether to still
> insist on this feature which seems not so important but a burden.
>    Thanks.
>
  

Re: [DISCUSSION] Does schema-free really need

Posted by Ted Dunning <te...@gmail.com>.
This is a bold statement.

And there are variants of it that could give users nearly the same
experience that we have now. For instance, if we cache discovered schemas
for old files and discover the schema for any new file that we see (and
cache it) before actually running a query. That gives us pretty much the
flexibility of schema on read without as much of the burden.



On Wed, Aug 15, 2018 at 5:02 PM weijie tong <to...@gmail.com> wrote:

> Hi all:
>   Hope the statement not seems too dash to you.
>   Drill claims be a schema-free distributed SQL engine. It pays lots of
> work to make the execution engine to support it to support JSON file like
> storage format. It is easier to make bugs and let the code logic ugly. I
> wonder do we still insist on this ,since we are designing the metadata
> system with DRILL-6552.
>    Traditionally, people is used to design its table schema firstly before
> firing a SQL query. I don't think this saves people too much time. Other
> system like Spark is popular not due to lack the schema claiming. I think
> we should be brave enough to take the right decision whether to still
> insist on this feature which seems not so important but a burden.
>     Thanks.
>