You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by weijie tong <to...@gmail.com> on 2018/06/28 15:45:42 UTC

Discussion about the metadata design

HI all:

As @aman ever noticed me about the roadmap of DRILL-2.0 ,which includes
the description of the metadata design (
https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E)
, I am interested in taking the role to implement the metadata part.
Here I fire this discussion thread to know your idea about this problem.

I have investigated some open source project about the metadata ,such
as Hive Metastore (
https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore)
,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
https://github.com/linkedin/WhereHows) ; Except Hive Metastore, other
projects have an high abstract definition to the actual physical metadata
which will benefit to extend to add new metadata property. Hive Metastore‘s
design is to the physical metadata , also with thrift interface to
different languages, but depend on the relational database not good to
scale and performance. To my opinion , I would prefer Hive Metastore as
our design template or just reuse it, as we don't need to do a rich
metadata management system. Maybe we should change the backend database to
a high query performance kv store like Hbase.

Besides the metadata interface design and the backend storage chosen, we
should also provide the random query ability . So users can calculate the
statistics like NDV to store in the metadata. Btw, maybe we can go further
to take in the Verdictdb (https://github.com/mozafari/verdictdb) to
provide more richful approximate query processing .

Re: Discussion about the metadata design

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,

Catching up on this old topic.

One of Drill's main differentiators is the ability to extend Drill with UDFs, custom storage and format plugins, custom security plugins, etc. I wonder if the team has considered taking a modular approach to metadata. Perhaps define a "metadata plugin" API. Then, allow implementations for the Hive Meta Store (HMS), for proprietary solutions (AtScale, Alation) or for simple ad-hoc use case (JSON schema files for specific collections of JSON data, or the existing Drill Parquet metadata files.)

Focusing on the API, rather than the implementation, would help Drill grow by allowing Drill to integrate with many different metadata systems.

Per Weijie's suggestion, I added the above as a comment to DRILL-6552, where I've also included a bit more detail based on the schema issues I've wrestled with in CSV and JSON, and in developing the "result set loader."

Thanks,
- Paul

 

    On Thursday, June 28, 2018, 8:30:52 PM PDT, weijie tong <to...@gmail.com> wrote:  
 
 Hi Vitalii:

  Glad to hear that you are also looking at this part. Let's  keep
discussion under that Jira.

On Fri, Jun 29, 2018 at 1:27 AM Vitalii Diravka <vi...@gmail.com>
wrote:

> Hi Weijie,
>
> Thanks for bringing this topic up!
>
> Basically you are right, Hive Metastore is one the best candidates for
> storing Driil's metadata.
> Also it will be good to make an abstraction, which will allow to implement
> and use other kind of tools for Metastore.
> The question of Metastore performance can be important especially for light
> Drill tables.
>
> Currently Vova and I are working on the proposal for metastore.
> I have created Jira DRILL-6552 [1] where all the related discussions can be
> held.
>
> [1] https://issues.apache.org/jira/browse/DRILL-6552
>
> Kind regards
> Vitalii
>
>
> On Thu, Jun 28, 2018 at 6:49 PM Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com>
> wrote:
>
> > Hi,
> >
> > Vitalii and Vova is also looking at this part, you might want to sync up
> > with them. Or even better, we can create Jira for this and held all
> > discussions there.
> > Vitalii, what do you think?
> >
> > Kind regards,
> > Arina
> >
> > On Thu, Jun 28, 2018 at 6:46 PM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > HI all:
> > >
> > >    As @aman ever noticed me about the roadmap of DRILL-2.0 ,which
> > includes
> > > the description of  the metadata design (
> > >
> > >
> >
> https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E
> > > )
> > > , I am interested in taking the role to implement the metadata part.
> > > Here I fire this discussion thread to know your idea about this
> problem.
> > >
> > >    I have investigated some open source project about the metadata
> ,such
> > > as Hive Metastore (
> > >
> https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore
> > )
> > > ,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
> > > https://github.com/linkedin/WhereHows)  ;  Except Hive Metastore,
> other
> > > projects have an high abstract definition to the actual physical
> metadata
> > > which will benefit to extend to add new metadata property. Hive
> > Metastore‘s
> > > design is to the physical metadata , also with thrift interface to
> > > different languages, but depend on the relational database  not good to
> > > scale and performance.  To my opinion , I would prefer Hive Metastore
> as
> > > our design template or just reuse it, as we don't need to do a rich
> > > metadata management system. Maybe we should change the backend database
> > to
> > > a high query performance kv store like Hbase.
> > >
> > >    Besides the metadata interface design and the backend storage
> chosen,
> > we
> > > should also provide the random query ability . So users can calculate
> the
> > > statistics like NDV to store in the metadata. Btw, maybe we can go
> > further
> > > to take in the Verdictdb  (https://github.com/mozafari/verdictdb) to
> > > provide more richful approximate query processing .
> > >
> >
>

Re: Discussion about the metadata design

Posted by weijie tong <to...@gmail.com>.

Hi Vitalii:

  Glad to hear that you are also looking at this part. Let's  keep
discussion under that Jira.

On Fri, Jun 29, 2018 at 1:27 AM Vitalii Diravka <vi...@gmail.com>
wrote:

> Hi Weijie,
>
> Thanks for bringing this topic up!
>
> Basically you are right, Hive Metastore is one the best candidates for
> storing Driil's metadata.
> Also it will be good to make an abstraction, which will allow to implement
> and use other kind of tools for Metastore.
> The question of Metastore performance can be important especially for light
> Drill tables.
>
> Currently Vova and I are working on the proposal for metastore.
> I have created Jira DRILL-6552 [1] where all the related discussions can be
> held.
>
> [1] https://issues.apache.org/jira/browse/DRILL-6552
>
> Kind regards
> Vitalii
>
>
> On Thu, Jun 28, 2018 at 6:49 PM Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com>
> wrote:
>
> > Hi,
> >
> > Vitalii and Vova is also looking at this part, you might want to sync up
> > with them. Or even better, we can create Jira for this and held all
> > discussions there.
> > Vitalii, what do you think?
> >
> > Kind regards,
> > Arina
> >
> > On Thu, Jun 28, 2018 at 6:46 PM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > HI all:
> > >
> > >     As @aman ever noticed me about the roadmap of DRILL-2.0 ,which
> > includes
> > > the description of  the metadata design (
> > >
> > >
> >
> https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E
> > > )
> > > , I am interested in taking the role to implement the metadata part.
> > > Here I fire this discussion thread to know your idea about this
> problem.
> > >
> > >     I have investigated some open source project about the metadata
> ,such
> > > as Hive Metastore (
> > >
> https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore
> > )
> > > ,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
> > > https://github.com/linkedin/WhereHows)  ;  Except Hive Metastore,
> other
> > > projects have an high abstract definition to the actual physical
> metadata
> > > which will benefit to extend to add new metadata property. Hive
> > Metastore‘s
> > > design is to the physical metadata , also with thrift interface to
> > > different languages, but depend on the relational database  not good to
> > > scale and performance.   To my opinion , I would prefer Hive Metastore
> as
> > > our design template or just reuse it, as we don't need to do a rich
> > > metadata management system. Maybe we should change the backend database
> > to
> > > a high query performance kv store like Hbase.
> > >
> > >    Besides the metadata interface design and the backend storage
> chosen,
> > we
> > > should also provide the random query ability . So users can calculate
> the
> > > statistics like NDV to store in the metadata. Btw, maybe we can go
> > further
> > > to take in the Verdictdb  (https://github.com/mozafari/verdictdb) to
> > > provide more richful approximate query processing .
> > >
> >
>

Re: Discussion about the metadata design

Posted by Vitalii Diravka <vi...@gmail.com>.

Hi Weijie,

Thanks for bringing this topic up!

Basically you are right, Hive Metastore is one the best candidates for
storing Driil's metadata.
Also it will be good to make an abstraction, which will allow to implement
and use other kind of tools for Metastore.
The question of Metastore performance can be important especially for light
Drill tables.

Currently Vova and I are working on the proposal for metastore.
I have created Jira DRILL-6552 [1] where all the related discussions can be
held.

[1] https://issues.apache.org/jira/browse/DRILL-6552

Kind regards
Vitalii


On Thu, Jun 28, 2018 at 6:49 PM Arina Yelchiyeva <ar...@gmail.com>
wrote:

> Hi,
>
> Vitalii and Vova is also looking at this part, you might want to sync up
> with them. Or even better, we can create Jira for this and held all
> discussions there.
> Vitalii, what do you think?
>
> Kind regards,
> Arina
>
> On Thu, Jun 28, 2018 at 6:46 PM weijie tong <to...@gmail.com>
> wrote:
>
> > HI all:
> >
> >     As @aman ever noticed me about the roadmap of DRILL-2.0 ,which
> includes
> > the description of  the metadata design (
> >
> >
> https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E
> > )
> > , I am interested in taking the role to implement the metadata part.
> > Here I fire this discussion thread to know your idea about this problem.
> >
> >     I have investigated some open source project about the metadata ,such
> > as Hive Metastore (
> > https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore
> )
> > ,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
> > https://github.com/linkedin/WhereHows)  ;  Except Hive Metastore, other
> > projects have an high abstract definition to the actual physical metadata
> > which will benefit to extend to add new metadata property. Hive
> Metastore‘s
> > design is to the physical metadata , also with thrift interface to
> > different languages, but depend on the relational database  not good to
> > scale and performance.   To my opinion , I would prefer Hive Metastore as
> > our design template or just reuse it, as we don't need to do a rich
> > metadata management system. Maybe we should change the backend database
> to
> > a high query performance kv store like Hbase.
> >
> >    Besides the metadata interface design and the backend storage chosen,
> we
> > should also provide the random query ability . So users can calculate the
> > statistics like NDV to store in the metadata. Btw, maybe we can go
> further
> > to take in the Verdictdb  (https://github.com/mozafari/verdictdb) to
> > provide more richful approximate query processing .
> >
>

Re: Discussion about the metadata design

Posted by Arina Yelchiyeva <ar...@gmail.com>.

Hi,

Vitalii and Vova is also looking at this part, you might want to sync up
with them. Or even better, we can create Jira for this and held all
discussions there.
Vitalii, what do you think?

Kind regards,
Arina

On Thu, Jun 28, 2018 at 6:46 PM weijie tong <to...@gmail.com> wrote:

> HI all:
>
>     As @aman ever noticed me about the roadmap of DRILL-2.0 ,which includes
> the description of  the metadata design (
>
> https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E
> )
> , I am interested in taking the role to implement the metadata part.
> Here I fire this discussion thread to know your idea about this problem.
>
>     I have investigated some open source project about the metadata ,such
> as Hive Metastore (
> https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore)
> ,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
> https://github.com/linkedin/WhereHows)  ;  Except Hive Metastore, other
> projects have an high abstract definition to the actual physical metadata
> which will benefit to extend to add new metadata property. Hive Metastore‘s
> design is to the physical metadata , also with thrift interface to
> different languages, but depend on the relational database  not good to
> scale and performance.   To my opinion , I would prefer Hive Metastore as
> our design template or just reuse it, as we don't need to do a rich
> metadata management system. Maybe we should change the backend database to
> a high query performance kv store like Hbase.
>
>    Besides the metadata interface design and the backend storage chosen, we
> should also provide the random query ability . So users can calculate the
> statistics like NDV to store in the metadata. Btw, maybe we can go further
> to take in the Verdictdb  (https://github.com/mozafari/verdictdb) to
> provide more richful approximate query processing .
>