You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Julian Hyde <ju...@gmail.com> on 2013/11/02 00:51:10 UTC

Re: Schema discovery

A recent blog post by Daniel Abadi has a similar theme:

http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/

We could create a tool that scans the raw files and generates an Optiq schema that contains views that apply "late schema" (the "EMP" and "DEPT" views in https://raw.github.com/apache/incubator-drill/HEAD/sqlparser/src/test/resources/test-models.json are examples of this). The user could interactively modify that schema (e.g. change a column's type from string to boolean or integer).

It's a nice approach because it doesn't impact the Drill engine. This is good. Metadata and data should be kept separate wherever possible.

Julian

Re: Schema discovery

Posted by Timothy Chen <tn...@gmail.com>.

It does seems interesting, will definitely want to look more deeper and see how much data store metadata does it expose as we can potentially leverage that build our storage engines.

Tim

Sent from my iPhone

> On Nov 2, 2013, at 11:18 PM, Dhruv <yo...@gmail.com> wrote:
> 
> Hi,
>    I think we might be able to reuse, Schema discovery features of http://metamodel.incubator.apache.org/
>    Although its seems incubator, but at http://metamodel.eobjects.org/goals.html it was pretty mature (version-ed till 3.4)
>    One of the most distinguising feature is "Traversing and building the *structure* of datastores"; which matches with our goal of "Late schema" support.
> 
> -Dhruv
>> On 11/02/2013 11:21 AM, Timothy Chen wrote:
>> Hi Julian,
>> 
>> Glad to have someone responded to this :) Yes I think going beyond just
>> having no schema defined up front to actually giving users possibilities is
>> definitely a much better interactive experience.
>> 
>> I would imagine though that it could impact Drill, or perhaps build more
>> statistics capabilities in Drill to query schema info, since not all data
>> is just raw files but could be living in different data stores, then I
>> would think we need to talk through the Drill storage engine abstraction to
>> get those info.
>> 
>> I'll chat about this with Jacques and folks next monday or in the Drill
>> user group.
>> 
>> Tim
>> 
>> 
>>> On Fri, Nov 1, 2013 at 4:51 PM, Julian Hyde <ju...@gmail.com> wrote:
>>> 
>>> A recent blog post by Daniel Abadi has a similar theme:
>>> 
>>> 
>>> http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/
>>> 
>>> We could create a tool that scans the raw files and generates an Optiq
>>> schema that contains views that apply "late schema" (the "EMP" and "DEPT"
>>> views in
>>> https://raw.github.com/apache/incubator-drill/HEAD/sqlparser/src/test/resources/test-models.jsonare examples of this). The user could interactively modify that schema
>>> (e.g. change a column's type from string to boolean or integer).
>>> 
>>> It's a nice approach because it doesn't impact the Drill engine. This is
>>> good. Metadata and data should be kept separate wherever possible.
>>> 
>>> Julian
>

Re: Schema discovery

Posted by Dhruv <yo...@gmail.com>.

Hi,
     I think we might be able to reuse, Schema discovery features of 
http://metamodel.incubator.apache.org/
     Although its seems incubator, but at 
http://metamodel.eobjects.org/goals.html it was pretty mature 
(version-ed till 3.4)
     One of the most distinguising feature is "Traversing and building 
the *structure* of datastores"; which matches with our goal of "Late 
schema" support.

-Dhruv
On 11/02/2013 11:21 AM, Timothy Chen wrote:
> Hi Julian,
>
> Glad to have someone responded to this :) Yes I think going beyond just
> having no schema defined up front to actually giving users possibilities is
> definitely a much better interactive experience.
>
> I would imagine though that it could impact Drill, or perhaps build more
> statistics capabilities in Drill to query schema info, since not all data
> is just raw files but could be living in different data stores, then I
> would think we need to talk through the Drill storage engine abstraction to
> get those info.
>
> I'll chat about this with Jacques and folks next monday or in the Drill
> user group.
>
> Tim
>
>
> On Fri, Nov 1, 2013 at 4:51 PM, Julian Hyde <ju...@gmail.com> wrote:
>
>> A recent blog post by Daniel Abadi has a similar theme:
>>
>>
>> http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/
>>
>> We could create a tool that scans the raw files and generates an Optiq
>> schema that contains views that apply "late schema" (the "EMP" and "DEPT"
>> views in
>> https://raw.github.com/apache/incubator-drill/HEAD/sqlparser/src/test/resources/test-models.jsonare examples of this). The user could interactively modify that schema
>> (e.g. change a column's type from string to boolean or integer).
>>
>> It's a nice approach because it doesn't impact the Drill engine. This is
>> good. Metadata and data should be kept separate wherever possible.
>>
>> Julian

Re: Schema discovery

Posted by Timothy Chen <tn...@gmail.com>.

Hi Julian,

Glad to have someone responded to this :) Yes I think going beyond just
having no schema defined up front to actually giving users possibilities is
definitely a much better interactive experience.

I would imagine though that it could impact Drill, or perhaps build more
statistics capabilities in Drill to query schema info, since not all data
is just raw files but could be living in different data stores, then I
would think we need to talk through the Drill storage engine abstraction to
get those info.

I'll chat about this with Jacques and folks next monday or in the Drill
user group.

Tim

On Fri, Nov 1, 2013 at 4:51 PM, Julian Hyde <ju...@gmail.com> wrote:

> A recent blog post by Daniel Abadi has a similar theme:
>
>
> http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/
>
> We could create a tool that scans the raw files and generates an Optiq
> schema that contains views that apply "late schema" (the "EMP" and "DEPT"
> views in
> https://raw.github.com/apache/incubator-drill/HEAD/sqlparser/src/test/resources/test-models.jsonare examples of this). The user could interactively modify that schema
> (e.g. change a column's type from string to boolean or integer).
>
> It's a nice approach because it doesn't impact the Drill engine. This is
> good. Metadata and data should be kept separate wherever possible.
>
> Julian

Re: Schema discovery

Posted by Timothy Chen <tn...@gmail.com>.

Hi Julian,

Glad to have someone responded to this :) Yes I think going beyond just
having no schema defined up front to actually giving users possibilities is
definitely a much better interactive experience.

I would imagine though that it could impact Drill, or perhaps build more
statistics capabilities in Drill to query schema info, since not all data
is just raw files but could be living in different data stores, then I
would think we need to talk through the Drill storage engine abstraction to
get those info.

I'll chat about this with Jacques and folks next monday or in the Drill
user group.

Tim

On Fri, Nov 1, 2013 at 4:51 PM, Julian Hyde <ju...@gmail.com> wrote:

> A recent blog post by Daniel Abadi has a similar theme:
>
>
> http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/
>
> We could create a tool that scans the raw files and generates an Optiq
> schema that contains views that apply "late schema" (the "EMP" and "DEPT"
> views in
> https://raw.github.com/apache/incubator-drill/HEAD/sqlparser/src/test/resources/test-models.jsonare examples of this). The user could interactively modify that schema
> (e.g. change a column's type from string to boolean or integer).
>
> It's a nice approach because it doesn't impact the Drill engine. This is
> good. Metadata and data should be kept separate wherever possible.
>
> Julian