You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/11/07 02:49:00 UTC
[jira] [Comment Edited] (DRILL-6829) Handle schema change in ExternalSort

    [ https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677572#comment-16677572 ] 

Paul Rogers edited comment on DRILL-6829 at 11/7/18 2:48 AM:
-------------------------------------------------------------

[~amansinha100], thanks for the explanation. A couple of observations. First, Drill is a relational engine, clients are often JDBC or ODBC. Such clients cannot handle a schema change. (Of course, the Drill client is more flexible, so it certainly an handle schema changes.)

Second, the union type has never really worked. There is no support for it in JDBC or ODBC. So, it would be a "Drill-client-only" solution. That may or not be bad depending on Drill's target user base.

There is now overwhelming evidence that for non Mongo data sources, that there is no way to achieve a reliable schema incrementally when data is delivered in random order.

So, maybe divide the problem into two parts. The schema mechanism for those users that use xDBC. And something clever like what is suggested here for those users of Mongo that use the Drill client and can absorb varying schemas. (Other DB's have this same property, including MapR DB JSON IIRC.)

My experience is with the uses and users of xDBC and similar interfaces. I don't know of any users of the raw Drill client, but I suppose they could exist...

In any event, rather than debate the topic to death, just go ahead and work out what happens when there are many files, scanned on many nodes, in random order, with each supported kind of schema change. It is very hard for any relational engine to make sense as the schema changes randomly across runs (because of the random scan order.) Work through those cases in detail and you'll go into this with your eyes wide open about what can actually be done in practice.

May also be pointing out: even MongoDB users will appreciate a schema if they have wild and crazy data types, but must deliver consistent schema results to JDBC or ODBC. So, even the proposal here can be made to work for the Drill client, there is even more value for making in work for Tableau (and similar) users.


was (Author: paul.rogers):
[~amansinha100], thanks for the explanation. A couple of observations. First, Drill is a relational engine, clients are often JDBC or ODBC. Such clients cannot handle a schema change. (Of course, the Drill client is more flexible, so it certainly an handle schema changes.)

Second, the union type has never really worked. There is no support for it in JDBC or ODBC. So, it would be a "Drill-client-only" solution. That may or not be bad depending on Drill's target user base.

There is now overwhelming evidence that for non Mongo data sources, that there is no way to achieve a reliable schema incrementally when data is delivered in random order.

So, maybe divide the problem into two parts. The schema mechanism for those users that use xDBC. And something clever like what is suggested here for those users of Mongo that use the Drill client and can absorb varying schemas. (Other DB's have this same property, including MapR DB JSON IIRC.)

My experience is with the uses and users of xDBC and similar interfaces. I don't know of any users of the raw Drill client, but I suppose they could exist...

In any event, rather than debate the topic to death, just go ahead and work out what happens when there are many files, scanned on many nodes, in random order, with each supported kind of schema change. It is very hard for any relational engine to make sense as the schema changes randomly across runs (because of the random scan order.) Work through those cases in detail and you'll go into this with your eyes wide open about what can actually be done in practice.

> Handle schema change in ExternalSort
> ------------------------------------
>
>                 Key: DRILL-6829
>                 URL: https://issues.apache.org/jira/browse/DRILL-6829
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Aman Sinha
>            Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in Drill, we also should explore what it means to be truly schema-less such that we can better handle \{semi, un}structured data, data sitting in DBs that store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators also need to be smarter about this but the problem is harder for the blocking operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)