You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Aman Sinha <am...@apache.org> on 2017/08/24 15:39:29 UTC

Drill 2.0 (design) hackathon

Drill Developers,

In order to kick-start the Drill 2.0  release discussions, I would like to
propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).

As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
for most of you!

The goal is to get the community together for a day-long technical
discussion on key topics in preparation for a Drill 2.0 release as well as
potential improvements in upcoming 1.xx releases.  Depending on the
interest areas, we could form groups and have a volunteer lead each group.

 Based on prior discussions on the dev list, hangouts and existing JIRAs,
there is already a substantial set of topics and I have summarized a few of
them below.   What other topics do folks want to talk about?   Feel free to
respond to this thread and I will create a google doc to consolidate.
Understandably, the list would be long but we will use the hackathon to get
a sense of a reasonable feature set for 1.xx and 2.0 releases.


1. Metadata management.

  1a: Defining an abstraction layer for various types of metadata: views,
schema, statistics, security

  1b: Underlying storage for metadata: what are the options and their
trade-offs?

      - Hive metastore

      - Parquet metadata cache (parquet specific)

      - An embedded DBMS

      - A distributed key-value store

      - Others..



2. Drill integration with Apache Arrow

  2a: Evaluate the choices and tradeoffs



3. Resource management

  3a: Memory limits per query

  3b: Spilling

  3c: Resource management with Drill on Yarn/Mesos/Kubernetes

  3d: Local vs. global resource management

  3e: Aligning with admission control/queueing



4. TPC-DS coverage and related planner/operator enhancements

  4a: Additional set operations: INTERSECT, EXCEPT

  4b: GROUPING SETS, ROLLUP, CUBE support

  4c: Handling inequality joins and cartesian joins of non-scalar inputs
(via Nested Loop Join)

  4d: Remaining gaps in correlated subquery

  4e: Statistics: Number of Distinct Values, Histograms



5. Schema handling

  5a: Creation, management of schema

  5b: Handling schema changes in certain common cases

  5c: Schema-awareness

  5d: Others TBD



6. Concurrency

  6a: What are the bottlenecks to achieving higher concurrency

  6b: Ideas to address these..e.g async execution ?



7. Storage plugins,  REST APIs related enhancements

    <Topics TBD>



8. Performance improvements

  8a: Filter pushdown

  8b: Vectorized Parquet reader

  8c: Code-gen improvements

  8d: Others TBD

Re: Drill 2.0 (design) hackathon

Posted by Aman Sinha <am...@apache.org>.
Hi Anil,
yes, certainly talking about your work with the Kafka+Drill integration
would be very welcome.
Regarding the registration, there isn't anything formal yet.  We would like
to keep it lightweight but
to keep track of the topics and attendees I will send out either a Google
doc or some such way to sign up.
This should happen within the next week or so.

Thanks,
-Aman

On Tue, Aug 29, 2017 at 11:56 AM, AnilKumar B <ak...@gmail.com> wrote:

> Hi Aman,
>
> To attend Drill Developer's Day event, is there any registration process?
>
> Me and Kamesh wanted to present Kafka integration with Drill (
> https://github.com/akumarb2010/incubator-drill/
> tree/master/contrib/storage-kafka
> & https://issues.apache.org/jira/browse/DRILL-4779 )
>
> Is it possible to provide 15-20 minutes time for this?
>
>
>
> Thanks & Regards,
> B Anil Kumar.
>
> On Thu, Aug 24, 2017 at 8:59 AM, Aman Sinha <am...@apache.org> wrote:
>
> > Hi Charles,
> > yes, it would be great if remote folks could participate..I will look
> into
> > the options for livestreaming.
> >
> >
> > On Thu, Aug 24, 2017 at 8:42 AM, Charles Givre <cg...@gmail.com> wrote:
> >
> > > Hi Aman,
> > > Would you consider doing some sort of livestream so that those of us
> who
> > > couldn’t be there in person can participate?
> > > Thanks,
> > > — C
> > >
> > > > On Aug 24, 2017, at 11:39, Aman Sinha <am...@apache.org> wrote:
> > > >
> > > > Drill Developers,
> > > >
> > > > In order to kick-start the Drill 2.0  release discussions, I would
> like
> > > to
> > > > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™
> J
> > ).
> > > >
> > > > As I mentioned in the hangout on Tuesday,  MapR has offered to host
> it
> > on
> > > > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> > works
> > > > for most of you!
> > > >
> > > > The goal is to get the community together for a day-long technical
> > > > discussion on key topics in preparation for a Drill 2.0 release as
> well
> > > as
> > > > potential improvements in upcoming 1.xx releases.  Depending on the
> > > > interest areas, we could form groups and have a volunteer lead each
> > > group.
> > > >
> > > > Based on prior discussions on the dev list, hangouts and existing
> > JIRAs,
> > > > there is already a substantial set of topics and I have summarized a
> > few
> > > of
> > > > them below.   What other topics do folks want to talk about?   Feel
> > free
> > > to
> > > > respond to this thread and I will create a google doc to consolidate.
> > > > Understandably, the list would be long but we will use the hackathon
> to
> > > get
> > > > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> > > >
> > > >
> > > > 1. Metadata management.
> > > >
> > > >  1a: Defining an abstraction layer for various types of metadata:
> > views,
> > > > schema, statistics, security
> > > >
> > > >  1b: Underlying storage for metadata: what are the options and their
> > > > trade-offs?
> > > >
> > > >      - Hive metastore
> > > >
> > > >      - Parquet metadata cache (parquet specific)
> > > >
> > > >      - An embedded DBMS
> > > >
> > > >      - A distributed key-value store
> > > >
> > > >      - Others..
> > > >
> > > >
> > > >
> > > > 2. Drill integration with Apache Arrow
> > > >
> > > >  2a: Evaluate the choices and tradeoffs
> > > >
> > > >
> > > >
> > > > 3. Resource management
> > > >
> > > >  3a: Memory limits per query
> > > >
> > > >  3b: Spilling
> > > >
> > > >  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> > > >
> > > >  3d: Local vs. global resource management
> > > >
> > > >  3e: Aligning with admission control/queueing
> > > >
> > > >
> > > >
> > > > 4. TPC-DS coverage and related planner/operator enhancements
> > > >
> > > >  4a: Additional set operations: INTERSECT, EXCEPT
> > > >
> > > >  4b: GROUPING SETS, ROLLUP, CUBE support
> > > >
> > > >  4c: Handling inequality joins and cartesian joins of non-scalar
> inputs
> > > > (via Nested Loop Join)
> > > >
> > > >  4d: Remaining gaps in correlated subquery
> > > >
> > > >  4e: Statistics: Number of Distinct Values, Histograms
> > > >
> > > >
> > > >
> > > > 5. Schema handling
> > > >
> > > >  5a: Creation, management of schema
> > > >
> > > >  5b: Handling schema changes in certain common cases
> > > >
> > > >  5c: Schema-awareness
> > > >
> > > >  5d: Others TBD
> > > >
> > > >
> > > >
> > > > 6. Concurrency
> > > >
> > > >  6a: What are the bottlenecks to achieving higher concurrency
> > > >
> > > >  6b: Ideas to address these..e.g async execution ?
> > > >
> > > >
> > > >
> > > > 7. Storage plugins,  REST APIs related enhancements
> > > >
> > > >    <Topics TBD>
> > > >
> > > >
> > > >
> > > > 8. Performance improvements
> > > >
> > > >  8a: Filter pushdown
> > > >
> > > >  8b: Vectorized Parquet reader
> > > >
> > > >  8c: Code-gen improvements
> > > >
> > > >  8d: Others TBD
> > >
> > >
> >
>

Re: Drill 2.0 (design) hackathon

Posted by AnilKumar B <ak...@gmail.com>.
Hi Aman,

To attend Drill Developer's Day event, is there any registration process?

Me and Kamesh wanted to present Kafka integration with Drill (
https://github.com/akumarb2010/incubator-drill/tree/master/contrib/storage-kafka
& https://issues.apache.org/jira/browse/DRILL-4779 )

Is it possible to provide 15-20 minutes time for this?



Thanks & Regards,
B Anil Kumar.

On Thu, Aug 24, 2017 at 8:59 AM, Aman Sinha <am...@apache.org> wrote:

> Hi Charles,
> yes, it would be great if remote folks could participate..I will look into
> the options for livestreaming.
>
>
> On Thu, Aug 24, 2017 at 8:42 AM, Charles Givre <cg...@gmail.com> wrote:
>
> > Hi Aman,
> > Would you consider doing some sort of livestream so that those of us who
> > couldn’t be there in person can participate?
> > Thanks,
> > — C
> >
> > > On Aug 24, 2017, at 11:39, Aman Sinha <am...@apache.org> wrote:
> > >
> > > Drill Developers,
> > >
> > > In order to kick-start the Drill 2.0  release discussions, I would like
> > to
> > > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J
> ).
> > >
> > > As I mentioned in the hangout on Tuesday,  MapR has offered to host it
> on
> > > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> works
> > > for most of you!
> > >
> > > The goal is to get the community together for a day-long technical
> > > discussion on key topics in preparation for a Drill 2.0 release as well
> > as
> > > potential improvements in upcoming 1.xx releases.  Depending on the
> > > interest areas, we could form groups and have a volunteer lead each
> > group.
> > >
> > > Based on prior discussions on the dev list, hangouts and existing
> JIRAs,
> > > there is already a substantial set of topics and I have summarized a
> few
> > of
> > > them below.   What other topics do folks want to talk about?   Feel
> free
> > to
> > > respond to this thread and I will create a google doc to consolidate.
> > > Understandably, the list would be long but we will use the hackathon to
> > get
> > > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> > >
> > >
> > > 1. Metadata management.
> > >
> > >  1a: Defining an abstraction layer for various types of metadata:
> views,
> > > schema, statistics, security
> > >
> > >  1b: Underlying storage for metadata: what are the options and their
> > > trade-offs?
> > >
> > >      - Hive metastore
> > >
> > >      - Parquet metadata cache (parquet specific)
> > >
> > >      - An embedded DBMS
> > >
> > >      - A distributed key-value store
> > >
> > >      - Others..
> > >
> > >
> > >
> > > 2. Drill integration with Apache Arrow
> > >
> > >  2a: Evaluate the choices and tradeoffs
> > >
> > >
> > >
> > > 3. Resource management
> > >
> > >  3a: Memory limits per query
> > >
> > >  3b: Spilling
> > >
> > >  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> > >
> > >  3d: Local vs. global resource management
> > >
> > >  3e: Aligning with admission control/queueing
> > >
> > >
> > >
> > > 4. TPC-DS coverage and related planner/operator enhancements
> > >
> > >  4a: Additional set operations: INTERSECT, EXCEPT
> > >
> > >  4b: GROUPING SETS, ROLLUP, CUBE support
> > >
> > >  4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > > (via Nested Loop Join)
> > >
> > >  4d: Remaining gaps in correlated subquery
> > >
> > >  4e: Statistics: Number of Distinct Values, Histograms
> > >
> > >
> > >
> > > 5. Schema handling
> > >
> > >  5a: Creation, management of schema
> > >
> > >  5b: Handling schema changes in certain common cases
> > >
> > >  5c: Schema-awareness
> > >
> > >  5d: Others TBD
> > >
> > >
> > >
> > > 6. Concurrency
> > >
> > >  6a: What are the bottlenecks to achieving higher concurrency
> > >
> > >  6b: Ideas to address these..e.g async execution ?
> > >
> > >
> > >
> > > 7. Storage plugins,  REST APIs related enhancements
> > >
> > >    <Topics TBD>
> > >
> > >
> > >
> > > 8. Performance improvements
> > >
> > >  8a: Filter pushdown
> > >
> > >  8b: Vectorized Parquet reader
> > >
> > >  8c: Code-gen improvements
> > >
> > >  8d: Others TBD
> >
> >
>

Re: Drill 2.0 (design) hackathon

Posted by Aman Sinha <am...@apache.org>.
Hi Charles,
yes, it would be great if remote folks could participate..I will look into
the options for livestreaming.


On Thu, Aug 24, 2017 at 8:42 AM, Charles Givre <cg...@gmail.com> wrote:

> Hi Aman,
> Would you consider doing some sort of livestream so that those of us who
> couldn’t be there in person can participate?
> Thanks,
> — C
>
> > On Aug 24, 2017, at 11:39, Aman Sinha <am...@apache.org> wrote:
> >
> > Drill Developers,
> >
> > In order to kick-start the Drill 2.0  release discussions, I would like
> to
> > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
> >
> > As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> > for most of you!
> >
> > The goal is to get the community together for a day-long technical
> > discussion on key topics in preparation for a Drill 2.0 release as well
> as
> > potential improvements in upcoming 1.xx releases.  Depending on the
> > interest areas, we could form groups and have a volunteer lead each
> group.
> >
> > Based on prior discussions on the dev list, hangouts and existing JIRAs,
> > there is already a substantial set of topics and I have summarized a few
> of
> > them below.   What other topics do folks want to talk about?   Feel free
> to
> > respond to this thread and I will create a google doc to consolidate.
> > Understandably, the list would be long but we will use the hackathon to
> get
> > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> >
> >
> > 1. Metadata management.
> >
> >  1a: Defining an abstraction layer for various types of metadata: views,
> > schema, statistics, security
> >
> >  1b: Underlying storage for metadata: what are the options and their
> > trade-offs?
> >
> >      - Hive metastore
> >
> >      - Parquet metadata cache (parquet specific)
> >
> >      - An embedded DBMS
> >
> >      - A distributed key-value store
> >
> >      - Others..
> >
> >
> >
> > 2. Drill integration with Apache Arrow
> >
> >  2a: Evaluate the choices and tradeoffs
> >
> >
> >
> > 3. Resource management
> >
> >  3a: Memory limits per query
> >
> >  3b: Spilling
> >
> >  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >
> >  3d: Local vs. global resource management
> >
> >  3e: Aligning with admission control/queueing
> >
> >
> >
> > 4. TPC-DS coverage and related planner/operator enhancements
> >
> >  4a: Additional set operations: INTERSECT, EXCEPT
> >
> >  4b: GROUPING SETS, ROLLUP, CUBE support
> >
> >  4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > (via Nested Loop Join)
> >
> >  4d: Remaining gaps in correlated subquery
> >
> >  4e: Statistics: Number of Distinct Values, Histograms
> >
> >
> >
> > 5. Schema handling
> >
> >  5a: Creation, management of schema
> >
> >  5b: Handling schema changes in certain common cases
> >
> >  5c: Schema-awareness
> >
> >  5d: Others TBD
> >
> >
> >
> > 6. Concurrency
> >
> >  6a: What are the bottlenecks to achieving higher concurrency
> >
> >  6b: Ideas to address these..e.g async execution ?
> >
> >
> >
> > 7. Storage plugins,  REST APIs related enhancements
> >
> >    <Topics TBD>
> >
> >
> >
> > 8. Performance improvements
> >
> >  8a: Filter pushdown
> >
> >  8b: Vectorized Parquet reader
> >
> >  8c: Code-gen improvements
> >
> >  8d: Others TBD
>
>

Re: Drill 2.0 (design) hackathon

Posted by Charles Givre <cg...@gmail.com>.
Hi Aman, 
Would you consider doing some sort of livestream so that those of us who couldn’t be there in person can participate?
Thanks,
— C

> On Aug 24, 2017, at 11:39, Aman Sinha <am...@apache.org> wrote:
> 
> Drill Developers,
> 
> In order to kick-start the Drill 2.0  release discussions, I would like to
> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
> 
> As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> for most of you!
> 
> The goal is to get the community together for a day-long technical
> discussion on key topics in preparation for a Drill 2.0 release as well as
> potential improvements in upcoming 1.xx releases.  Depending on the
> interest areas, we could form groups and have a volunteer lead each group.
> 
> Based on prior discussions on the dev list, hangouts and existing JIRAs,
> there is already a substantial set of topics and I have summarized a few of
> them below.   What other topics do folks want to talk about?   Feel free to
> respond to this thread and I will create a google doc to consolidate.
> Understandably, the list would be long but we will use the hackathon to get
> a sense of a reasonable feature set for 1.xx and 2.0 releases.
> 
> 
> 1. Metadata management.
> 
>  1a: Defining an abstraction layer for various types of metadata: views,
> schema, statistics, security
> 
>  1b: Underlying storage for metadata: what are the options and their
> trade-offs?
> 
>      - Hive metastore
> 
>      - Parquet metadata cache (parquet specific)
> 
>      - An embedded DBMS
> 
>      - A distributed key-value store
> 
>      - Others..
> 
> 
> 
> 2. Drill integration with Apache Arrow
> 
>  2a: Evaluate the choices and tradeoffs
> 
> 
> 
> 3. Resource management
> 
>  3a: Memory limits per query
> 
>  3b: Spilling
> 
>  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> 
>  3d: Local vs. global resource management
> 
>  3e: Aligning with admission control/queueing
> 
> 
> 
> 4. TPC-DS coverage and related planner/operator enhancements
> 
>  4a: Additional set operations: INTERSECT, EXCEPT
> 
>  4b: GROUPING SETS, ROLLUP, CUBE support
> 
>  4c: Handling inequality joins and cartesian joins of non-scalar inputs
> (via Nested Loop Join)
> 
>  4d: Remaining gaps in correlated subquery
> 
>  4e: Statistics: Number of Distinct Values, Histograms
> 
> 
> 
> 5. Schema handling
> 
>  5a: Creation, management of schema
> 
>  5b: Handling schema changes in certain common cases
> 
>  5c: Schema-awareness
> 
>  5d: Others TBD
> 
> 
> 
> 6. Concurrency
> 
>  6a: What are the bottlenecks to achieving higher concurrency
> 
>  6b: Ideas to address these..e.g async execution ?
> 
> 
> 
> 7. Storage plugins,  REST APIs related enhancements
> 
>    <Topics TBD>
> 
> 
> 
> 8. Performance improvements
> 
>  8a: Filter pushdown
> 
>  8b: Vectorized Parquet reader
> 
>  8c: Code-gen improvements
> 
>  8d: Others TBD


Re: Drill 2.0 (design) hackathon

Posted by Muhammad Gelbana <m....@gmail.com>.
Understood. But if it's possible to stream the event, may be we can do the
streaming through YouTube too, which can archive the stream afterwards. But
it's up to 8 hours only.

I'm not a YouTube expert though.

https://support.google.com/youtube/answer/6247592

I'm just afraid I may not be able to attend and I'm very interested into
what you guys are going to discuss.

On Sep 7, 2017 1:07 AM, "Pritesh Maker" <pm...@mapr.com> wrote:

> Hi
>
> We don't plan on recording the event (it's a day long event!) but are
> looking at options to have a WebEx or Hangout link if folks want to join
> remotely.
>
> Pritesh
> _____________________________
> From: Muhammad Gelbana <m....@gmail.com>>
> Sent: Wednesday, September 6, 2017 1:08 AM
> Subject: Re: Drill 2.0 (design) hackathon
> To: <de...@drill.apache.org>>
>
>
> Would anyone kindly own the recording of the event ?
>
> On Sep 6, 2017 7:47 AM, "Aman Sinha" <amansinha@apache.org<mailto:a
> mansinha@apache.org>> wrote:
>
> > Here is the Eventbrite event for registration:
> >
> > https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> > registration-7478463285
> >
> > Please register so we can plan for food and drinks appropriately.
> >
> > The link also contains a google doc link for the preliminary agenda and a
> > 'Topics' tab with volunteer sign-up column. Please add your name to the
> > area(s) of interest.
> >
> > Thanks and look forward to seeing you all !
> >
> > -Aman
> >
> > On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <progers@mapr.com<mailto:
> progers@mapr.com>> wrote:
> >
> > > A partial list of Drill’s public APIs:
> > >
> > > IMHO, highest priority for Drill 2.0.
> > >
> > >
> > > * JDBC/ODBC drivers
> > > * Client (for JDBC/ODBC) + ODBC & JDBC
> > > * Client (for full Drill async, columnar)
> > > * Storage plugin
> > > * Format plugin
> > > * System/session options
> > > * Queueing (e.g. ZK-based queues)
> > > * Rest API
> > > * Resource Planning (e.g. max query memory per node)
> > > * Metadata access, storage (e.g. file system locations vs. a
> > metastore)
> > > * Metadata files formats (Parquet, views, etc.)
> > >
> > > Lower priority for future releases:
> > >
> > >
> > > * Query Planning (e.g. Calcite rules)
> > > * Config options
> > > * SQL syntax, especially Drill extensions
> > > * UDF
> > > * Management (e.g. JMX, Rest API calls, etc.)
> > > * Drill File System (HDFS)
> > > * Web UI
> > > * Shell scripts
> > >
> > > There are certainly more. Please suggest those that are missing. I’ve
> > > taken a rough cut at which APIs need forward/backward compatibility
> > first,
> > > in part based on those that are the “most public” and most likely to
> > > change. Others are important, but we can’t do them all at once.
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > > On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org><mailto:a
> > > mansinha@apache.org<ma...@apache.org>>> wrote:
> > >
> > > Hi Paul,
> > > certainly makes sense to have the API compatibility discussions during
> > this
> > > hackathon. The 2.0 release may be a good checkpoint to introduce
> > breaking
> > > changes necessitating changes to the ODBC/JDBC drivers and other
> external
> > > applications. As part of this exercise (not during the hackathon but
> as a
> > > follow-up action), we also should clearly identify the "public"
> > interfaces.
> > >
> > >
> > > I will add this to the agenda.
> > >
> > > thanks,
> > > -Aman
> > >
> > > On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> progers@mapr.com><mailto:
> > > progers@mapr.com<ma...@mapr.com>>> wrote:
> > >
> > > Thanks Aman for organizing the Hackathon!
> > >
> > > The list included many good ideas for Drill 2.0. Some of those require
> > > changes to Drill’s “public” interfaces (file format, client protocol,
> SQL
> > > behavior, etc.)
> > >
> > > At present, Drill has no good mechanism to handle backward/forward
> > > compatibility at the API level. Protobuf versioning certainly helps,
> but
> > > can’t completely solve semantic changes (where a field changes meaning,
> > or
> > > a non-Protobuf data chunk changes format.) As just one concrete
> example,
> > > changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> > > names and data formats will change.
> > >
> > > Perhaps we can prioritize, for the proposed 2.0 release, a one-time set
> > of
> > > breaking changes that introduce a versioning mechanism into our public
> > > APIs. Once these are in place, we can evolve the APIs in the future by
> > > following the newly-created versioning protocol.
> > >
> > > Without such a mechanism, we cannot support old & new clients in the
> same
> > > cluster. Nor can we support rolling upgrades. Of course, another
> solution
> > > is to get it right the second time, then freeze all APIs and agree to
> > never
> > > again change them. Not sure we have sufficient access to a crystal ball
> > to
> > > predict everything we’d ever need in our APIs, however...
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > > On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org><mailto:a
> > > mansinha@apache.org<ma...@apache.org>>> wrote:
> > >
> > > Drill Developers,
> > >
> > > In order to kick-start the Drill 2.0 release discussions, I would like
> > > to
> > > propose a Drill 2.0 (design) hackathon (a.k.a Drill Developer Day ™ J
> ).
> > >
> > > As I mentioned in the hangout on Tuesday, MapR has offered to host it
> on
> > > Sept 18th at their offices at 350 Holger Way, San Jose. Hope that works
> > > for most of you!
> > >
> > > The goal is to get the community together for a day-long technical
> > > discussion on key topics in preparation for a Drill 2.0 release as well
> > > as
> > > potential improvements in upcoming 1.xx releases. Depending on the
> > > interest areas, we could form groups and have a volunteer lead each
> > > group.
> > >
> > > Based on prior discussions on the dev list, hangouts and existing
> JIRAs,
> > > there is already a substantial set of topics and I have summarized a
> few
> > > of
> > > them below. What other topics do folks want to talk about? Feel free
> > > to
> > > respond to this thread and I will create a google doc to consolidate.
> > > Understandably, the list would be long but we will use the hackathon to
> > > get
> > > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> > >
> > >
> > > 1. Metadata management.
> > >
> > > 1a: Defining an abstraction layer for various types of metadata: views,
> > > schema, statistics, security
> > >
> > > 1b: Underlying storage for metadata: what are the options and their
> > > trade-offs?
> > >
> > > - Hive metastore
> > >
> > > - Parquet metadata cache (parquet specific)
> > >
> > > - An embedded DBMS
> > >
> > > - A distributed key-value store
> > >
> > > - Others..
> > >
> > >
> > >
> > > 2. Drill integration with Apache Arrow
> > >
> > > 2a: Evaluate the choices and tradeoffs
> > >
> > >
> > >
> > > 3. Resource management
> > >
> > > 3a: Memory limits per query
> > >
> > > 3b: Spilling
> > >
> > > 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> > >
> > > 3d: Local vs. global resource management
> > >
> > > 3e: Aligning with admission control/queueing
> > >
> > >
> > >
> > > 4. TPC-DS coverage and related planner/operator enhancements
> > >
> > > 4a: Additional set operations: INTERSECT, EXCEPT
> > >
> > > 4b: GROUPING SETS, ROLLUP, CUBE support
> > >
> > > 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > > (via Nested Loop Join)
> > >
> > > 4d: Remaining gaps in correlated subquery
> > >
> > > 4e: Statistics: Number of Distinct Values, Histograms
> > >
> > >
> > >
> > > 5. Schema handling
> > >
> > > 5a: Creation, management of schema
> > >
> > > 5b: Handling schema changes in certain common cases
> > >
> > > 5c: Schema-awareness
> > >
> > > 5d: Others TBD
> > >
> > >
> > >
> > > 6. Concurrency
> > >
> > > 6a: What are the bottlenecks to achieving higher concurrency
> > >
> > > 6b: Ideas to address these..e.g async execution ?
> > >
> > >
> > >
> > > 7. Storage plugins, REST APIs related enhancements
> > >
> > > <Topics TBD>
> > >
> > >
> > >
> > > 8. Performance improvements
> > >
> > > 8a: Filter pushdown
> > >
> > > 8b: Vectorized Parquet reader
> > >
> > > 8c: Code-gen improvements
> > >
> > > 8d: Others TBD
> > >
> > >
> > >
> > >
> >
>
>
>

Re: Drill 2.0 (design) hackathon

Posted by Pritesh Maker <pm...@mapr.com>.
Hi

We don't plan on recording the event (it's a day long event!) but are looking at options to have a WebEx or Hangout link if folks want to join remotely.

Pritesh
_____________________________
From: Muhammad Gelbana <m....@gmail.com>>
Sent: Wednesday, September 6, 2017 1:08 AM
Subject: Re: Drill 2.0 (design) hackathon
To: <de...@drill.apache.org>>


Would anyone kindly own the recording of the event ?

On Sep 6, 2017 7:47 AM, "Aman Sinha" <am...@apache.org>> wrote:

> Here is the Eventbrite event for registration:
>
> https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> registration-7478463285
>
> Please register so we can plan for food and drinks appropriately.
>
> The link also contains a google doc link for the preliminary agenda and a
> 'Topics' tab with volunteer sign-up column. Please add your name to the
> area(s) of interest.
>
> Thanks and look forward to seeing you all !
>
> -Aman
>
> On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com>> wrote:
>
> > A partial list of Drill’s public APIs:
> >
> > IMHO, highest priority for Drill 2.0.
> >
> >
> > * JDBC/ODBC drivers
> > * Client (for JDBC/ODBC) + ODBC & JDBC
> > * Client (for full Drill async, columnar)
> > * Storage plugin
> > * Format plugin
> > * System/session options
> > * Queueing (e.g. ZK-based queues)
> > * Rest API
> > * Resource Planning (e.g. max query memory per node)
> > * Metadata access, storage (e.g. file system locations vs. a
> metastore)
> > * Metadata files formats (Parquet, views, etc.)
> >
> > Lower priority for future releases:
> >
> >
> > * Query Planning (e.g. Calcite rules)
> > * Config options
> > * SQL syntax, especially Drill extensions
> > * UDF
> > * Management (e.g. JMX, Rest API calls, etc.)
> > * Drill File System (HDFS)
> > * Web UI
> > * Shell scripts
> >
> > There are certainly more. Please suggest those that are missing. I’ve
> > taken a rough cut at which APIs need forward/backward compatibility
> first,
> > in part based on those that are the “most public” and most likely to
> > change. Others are important, but we can’t do them all at once.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Aug 29, 2017, at 6:00 PM, Aman Sinha <am...@apache.org><mailto:a
> > mansinha@apache.org<ma...@apache.org>>> wrote:
> >
> > Hi Paul,
> > certainly makes sense to have the API compatibility discussions during
> this
> > hackathon. The 2.0 release may be a good checkpoint to introduce
> breaking
> > changes necessitating changes to the ODBC/JDBC drivers and other external
> > applications. As part of this exercise (not during the hackathon but as a
> > follow-up action), we also should clearly identify the "public"
> interfaces.
> >
> >
> > I will add this to the agenda.
> >
> > thanks,
> > -Aman
> >
> > On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <pr...@mapr.com><mailto:
> > progers@mapr.com<ma...@mapr.com>>> wrote:
> >
> > Thanks Aman for organizing the Hackathon!
> >
> > The list included many good ideas for Drill 2.0. Some of those require
> > changes to Drill’s “public” interfaces (file format, client protocol, SQL
> > behavior, etc.)
> >
> > At present, Drill has no good mechanism to handle backward/forward
> > compatibility at the API level. Protobuf versioning certainly helps, but
> > can’t completely solve semantic changes (where a field changes meaning,
> or
> > a non-Protobuf data chunk changes format.) As just one concrete example,
> > changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> > names and data formats will change.
> >
> > Perhaps we can prioritize, for the proposed 2.0 release, a one-time set
> of
> > breaking changes that introduce a versioning mechanism into our public
> > APIs. Once these are in place, we can evolve the APIs in the future by
> > following the newly-created versioning protocol.
> >
> > Without such a mechanism, we cannot support old & new clients in the same
> > cluster. Nor can we support rolling upgrades. Of course, another solution
> > is to get it right the second time, then freeze all APIs and agree to
> never
> > again change them. Not sure we have sufficient access to a crystal ball
> to
> > predict everything we’d ever need in our APIs, however...
> >
> > Thanks,
> >
> > - Paul
> >
> > On Aug 24, 2017, at 8:39 AM, Aman Sinha <am...@apache.org><mailto:a
> > mansinha@apache.org<ma...@apache.org>>> wrote:
> >
> > Drill Developers,
> >
> > In order to kick-start the Drill 2.0 release discussions, I would like
> > to
> > propose a Drill 2.0 (design) hackathon (a.k.a Drill Developer Day ™ J ).
> >
> > As I mentioned in the hangout on Tuesday, MapR has offered to host it on
> > Sept 18th at their offices at 350 Holger Way, San Jose. Hope that works
> > for most of you!
> >
> > The goal is to get the community together for a day-long technical
> > discussion on key topics in preparation for a Drill 2.0 release as well
> > as
> > potential improvements in upcoming 1.xx releases. Depending on the
> > interest areas, we could form groups and have a volunteer lead each
> > group.
> >
> > Based on prior discussions on the dev list, hangouts and existing JIRAs,
> > there is already a substantial set of topics and I have summarized a few
> > of
> > them below. What other topics do folks want to talk about? Feel free
> > to
> > respond to this thread and I will create a google doc to consolidate.
> > Understandably, the list would be long but we will use the hackathon to
> > get
> > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> >
> >
> > 1. Metadata management.
> >
> > 1a: Defining an abstraction layer for various types of metadata: views,
> > schema, statistics, security
> >
> > 1b: Underlying storage for metadata: what are the options and their
> > trade-offs?
> >
> > - Hive metastore
> >
> > - Parquet metadata cache (parquet specific)
> >
> > - An embedded DBMS
> >
> > - A distributed key-value store
> >
> > - Others..
> >
> >
> >
> > 2. Drill integration with Apache Arrow
> >
> > 2a: Evaluate the choices and tradeoffs
> >
> >
> >
> > 3. Resource management
> >
> > 3a: Memory limits per query
> >
> > 3b: Spilling
> >
> > 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >
> > 3d: Local vs. global resource management
> >
> > 3e: Aligning with admission control/queueing
> >
> >
> >
> > 4. TPC-DS coverage and related planner/operator enhancements
> >
> > 4a: Additional set operations: INTERSECT, EXCEPT
> >
> > 4b: GROUPING SETS, ROLLUP, CUBE support
> >
> > 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > (via Nested Loop Join)
> >
> > 4d: Remaining gaps in correlated subquery
> >
> > 4e: Statistics: Number of Distinct Values, Histograms
> >
> >
> >
> > 5. Schema handling
> >
> > 5a: Creation, management of schema
> >
> > 5b: Handling schema changes in certain common cases
> >
> > 5c: Schema-awareness
> >
> > 5d: Others TBD
> >
> >
> >
> > 6. Concurrency
> >
> > 6a: What are the bottlenecks to achieving higher concurrency
> >
> > 6b: Ideas to address these..e.g async execution ?
> >
> >
> >
> > 7. Storage plugins, REST APIs related enhancements
> >
> > <Topics TBD>
> >
> >
> >
> > 8. Performance improvements
> >
> > 8a: Filter pushdown
> >
> > 8b: Vectorized Parquet reader
> >
> > 8c: Code-gen improvements
> >
> > 8d: Others TBD
> >
> >
> >
> >
>



Re: Drill 2.0 (design) hackathon

Posted by Muhammad Gelbana <m....@gmail.com>.
Would anyone kindly own the recording of the event ?

On Sep 6, 2017 7:47 AM, "Aman Sinha" <am...@apache.org> wrote:

> Here is the Eventbrite event for registration:
>
> https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> registration-7478463285
>
> Please register so we can plan for food and drinks appropriately.
>
> The link also contains a google doc link for the preliminary agenda and a
> 'Topics' tab with volunteer sign-up column.  Please add your name to the
> area(s) of interest.
>
> Thanks and look forward to seeing you all !
>
> -Aman
>
> On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com> wrote:
>
> > A partial list of Drill’s public APIs:
> >
> > IMHO, highest priority for Drill 2.0.
> >
> >
> >   *   JDBC/ODBC drivers
> >   *   Client (for JDBC/ODBC) + ODBC & JDBC
> >   *   Client (for full Drill async, columnar)
> >   *   Storage plugin
> >   *   Format plugin
> >   *   System/session options
> >   *   Queueing (e.g. ZK-based queues)
> >   *   Rest API
> >   *   Resource Planning (e.g. max query memory per node)
> >   *   Metadata access, storage (e.g. file system locations vs. a
> metastore)
> >   *   Metadata files formats (Parquet, views, etc.)
> >
> > Lower priority for future releases:
> >
> >
> >   *   Query Planning (e.g. Calcite rules)
> >   *   Config options
> >   *   SQL syntax, especially Drill extensions
> >   *   UDF
> >   *   Management (e.g. JMX, Rest API calls, etc.)
> >   *   Drill File System (HDFS)
> >   *   Web UI
> >   *   Shell scripts
> >
> > There are certainly more. Please suggest those that are missing. I’ve
> > taken a rough cut at which APIs need forward/backward compatibility
> first,
> > in part based on those that are the “most public” and most likely to
> > change. Others are important, but we can’t do them all at once.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
> > mansinha@apache.org>> wrote:
> >
> > Hi Paul,
> > certainly makes sense to have the API compatibility discussions during
> this
> > hackathon.  The 2.0 release may be a good checkpoint to introduce
> breaking
> > changes necessitating changes to the ODBC/JDBC drivers and other external
> > applications. As part of this exercise (not during the hackathon but as a
> > follow-up action), we also should clearly identify the "public"
> interfaces.
> >
> >
> > I will add this to the agenda.
> >
> > thanks,
> > -Aman
> >
> > On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> > progers@mapr.com>> wrote:
> >
> > Thanks Aman for organizing the Hackathon!
> >
> > The list included many good ideas for Drill 2.0. Some of those require
> > changes to Drill’s “public” interfaces (file format, client protocol, SQL
> > behavior, etc.)
> >
> > At present, Drill has no good mechanism to handle backward/forward
> > compatibility at the API level. Protobuf versioning certainly helps, but
> > can’t completely solve semantic changes (where a field changes meaning,
> or
> > a non-Protobuf data chunk changes format.) As just one concrete example,
> > changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> > names and data formats will change.
> >
> > Perhaps we can prioritize, for the proposed 2.0 release, a one-time set
> of
> > breaking changes that introduce a versioning mechanism into our public
> > APIs. Once these are in place, we can evolve the APIs in the future by
> > following the newly-created versioning protocol.
> >
> > Without such a mechanism, we cannot support old & new clients in the same
> > cluster. Nor can we support rolling upgrades. Of course, another solution
> > is to get it right the second time, then freeze all APIs and agree to
> never
> > again change them. Not sure we have sufficient access to a crystal ball
> to
> > predict everything we’d ever need in our APIs, however...
> >
> > Thanks,
> >
> > - Paul
> >
> > On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
> > mansinha@apache.org>> wrote:
> >
> > Drill Developers,
> >
> > In order to kick-start the Drill 2.0  release discussions, I would like
> > to
> > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
> >
> > As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> > for most of you!
> >
> > The goal is to get the community together for a day-long technical
> > discussion on key topics in preparation for a Drill 2.0 release as well
> > as
> > potential improvements in upcoming 1.xx releases.  Depending on the
> > interest areas, we could form groups and have a volunteer lead each
> > group.
> >
> > Based on prior discussions on the dev list, hangouts and existing JIRAs,
> > there is already a substantial set of topics and I have summarized a few
> > of
> > them below.   What other topics do folks want to talk about?   Feel free
> > to
> > respond to this thread and I will create a google doc to consolidate.
> > Understandably, the list would be long but we will use the hackathon to
> > get
> > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> >
> >
> > 1. Metadata management.
> >
> > 1a: Defining an abstraction layer for various types of metadata: views,
> > schema, statistics, security
> >
> > 1b: Underlying storage for metadata: what are the options and their
> > trade-offs?
> >
> >     - Hive metastore
> >
> >     - Parquet metadata cache (parquet specific)
> >
> >     - An embedded DBMS
> >
> >     - A distributed key-value store
> >
> >     - Others..
> >
> >
> >
> > 2. Drill integration with Apache Arrow
> >
> > 2a: Evaluate the choices and tradeoffs
> >
> >
> >
> > 3. Resource management
> >
> > 3a: Memory limits per query
> >
> > 3b: Spilling
> >
> > 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >
> > 3d: Local vs. global resource management
> >
> > 3e: Aligning with admission control/queueing
> >
> >
> >
> > 4. TPC-DS coverage and related planner/operator enhancements
> >
> > 4a: Additional set operations: INTERSECT, EXCEPT
> >
> > 4b: GROUPING SETS, ROLLUP, CUBE support
> >
> > 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > (via Nested Loop Join)
> >
> > 4d: Remaining gaps in correlated subquery
> >
> > 4e: Statistics: Number of Distinct Values, Histograms
> >
> >
> >
> > 5. Schema handling
> >
> > 5a: Creation, management of schema
> >
> > 5b: Handling schema changes in certain common cases
> >
> > 5c: Schema-awareness
> >
> > 5d: Others TBD
> >
> >
> >
> > 6. Concurrency
> >
> > 6a: What are the bottlenecks to achieving higher concurrency
> >
> > 6b: Ideas to address these..e.g async execution ?
> >
> >
> >
> > 7. Storage plugins,  REST APIs related enhancements
> >
> >   <Topics TBD>
> >
> >
> >
> > 8. Performance improvements
> >
> > 8a: Filter pushdown
> >
> > 8b: Vectorized Parquet reader
> >
> > 8c: Code-gen improvements
> >
> > 8d: Others TBD
> >
> >
> >
> >
>

Re: Drill 2.0 (design) hackathon

Posted by AnilKumar B <ak...@gmail.com>.
Thanks All, it is really helpful.

On Wed, Sep 20, 2017 at 8:13 AM Charles Givre <cg...@gmail.com> wrote:

> Thank you Aman for organizing and to MapR for hosting!
>
> On Wed, Sep 20, 2017 at 11:12 AM, Aman Sinha <am...@apache.org> wrote:
>
> > Thanks to all the folks who attended the hackathon - both local and
> remote.
> >   For the remote attendees, you missed out on a good dinner :)
> >
> > We had a day of excellent discussion on several topics:  Resource
> > management, operator level performance improvements, TPC-DS coverage,
> > metadata management, concurrency, usability and error handling, storage
> > plugins + rest APIs.   It will take a couple of days to compile all the
> > notes and we will post them.
> >
> > Since the focus was more in-depth discussion rather than breadth, and 1
> day
> > is clearly not adequate, some topics were left out.  We can continue
> those
> > discussions on the dev list / hangout  or if it can wait, possibly do it
> in
> > a future hackathon.
> >
> > -Aman
> >
> > On Fri, Sep 15, 2017 at 2:54 PM, Charles Givre <cg...@gmail.com> wrote:
> >
> > > Hi Pritesh,
> > > What time do you think you’d want me to present?  Also, should I make
> > some
> > > slides?
> > > Best,
> > > — C
> > >
> > > > On Sep 15, 2017, at 13:23, Pritesh Maker <pm...@mapr.com> wrote:
> > > >
> > > > Hi All
> > > >
> > > > We are looking forward to hosting the hackathon on Monday. Just a few
> > > updates on the logistics and agenda
> > > >
> > > > • We are expecting over 25 people attending the event – you can see
> the
> > > attendee list at the Eventbrite site -  https://www.eventbrite.com/e/
> > > drill-developer-day-sept-2017-registration-7478463285
> > > >
> > > > • Breakfast will be served starting at 8:30AM – we would like to
> begin
> > > promptly at 9AM
> > > >
> > > > • The agenda has been updated to reflect the speakers (see the update
> > in
> > > the sheet - https://docs.google.com/spreadsheets/d/
> > > 1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
> > > > o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha
> > > > o Community Contributions – Anil Kumar, John Omernik, Charles Givre
> and
> > > Ted Dunning
> > > > o Two tracks for technical design discussions – some topics have
> > initial
> > > thoughts for the topics and some will have open brainstorming
> discussions
> > > > o Once the discussions are concluded, we will have summaries
> presented
> > > and notes shared with the community
> > > >
> > > > • We will have a WebEx for the first two sessions. For the two
> tracks,
> > > we will either continue the WebEx or have Hangout links (will publish
> > them
> > > to the google sheet)
> > > > "JOIN WEBEX MEETING
> > > >
> https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6
> > c76
> > > > Meeting number (access code): 806 111 950
> > > > Meeting password: ApacheDrill"
> > > >
> > > > • For the attendees in person, we have made bookings for a dinner in
> > the
> > > evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas
> > > >
> > > > Looking forward to a fantastic day for the Apache Drill! community!
> > > >
> > > > Thanks,
> > > > Pritesh
> > > >
> > > >
> > > >
> > > > On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:
> > > >
> > > >    Here is the Eventbrite event for registration:
> > > >
> > > >    https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> > > registration-7478463285
> > > >
> > > >    Please register so we can plan for food and drinks appropriately.
> > > >
> > > >    The link also contains a google doc link for the preliminary
> agenda
> > > and a
> > > >    'Topics' tab with volunteer sign-up column.  Please add your name
> to
> > > the
> > > >    area(s) of interest.
> > > >
> > > >    Thanks and look forward to seeing you all !
> > > >
> > > >    -Aman
> > > >
> > > >    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com>
> > > wrote:
> > > >
> > > >> A partial list of Drill’s public APIs:
> > > >>
> > > >> IMHO, highest priority for Drill 2.0.
> > > >>
> > > >>
> > > >>  *   JDBC/ODBC drivers
> > > >>  *   Client (for JDBC/ODBC) + ODBC & JDBC
> > > >>  *   Client (for full Drill async, columnar)
> > > >>  *   Storage plugin
> > > >>  *   Format plugin
> > > >>  *   System/session options
> > > >>  *   Queueing (e.g. ZK-based queues)
> > > >>  *   Rest API
> > > >>  *   Resource Planning (e.g. max query memory per node)
> > > >>  *   Metadata access, storage (e.g. file system locations vs. a
> > > metastore)
> > > >>  *   Metadata files formats (Parquet, views, etc.)
> > > >>
> > > >> Lower priority for future releases:
> > > >>
> > > >>
> > > >>  *   Query Planning (e.g. Calcite rules)
> > > >>  *   Config options
> > > >>  *   SQL syntax, especially Drill extensions
> > > >>  *   UDF
> > > >>  *   Management (e.g. JMX, Rest API calls, etc.)
> > > >>  *   Drill File System (HDFS)
> > > >>  *   Web UI
> > > >>  *   Shell scripts
> > > >>
> > > >> There are certainly more. Please suggest those that are missing.
> I’ve
> > > >> taken a rough cut at which APIs need forward/backward compatibility
> > > first,
> > > >> in part based on those that are the “most public” and most likely to
> > > >> change. Others are important, but we can’t do them all at once.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> - Paul
> > > >>
> > > >> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org
> <mailto:
> > a
> > > >> mansinha@apache.org>> wrote:
> > > >>
> > > >> Hi Paul,
> > > >> certainly makes sense to have the API compatibility discussions
> during
> > > this
> > > >> hackathon.  The 2.0 release may be a good checkpoint to introduce
> > > breaking
> > > >> changes necessitating changes to the ODBC/JDBC drivers and other
> > > external
> > > >> applications. As part of this exercise (not during the hackathon but
> > as
> > > a
> > > >> follow-up action), we also should clearly identify the "public"
> > > interfaces.
> > > >>
> > > >>
> > > >> I will add this to the agenda.
> > > >>
> > > >> thanks,
> > > >> -Aman
> > > >>
> > > >> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com
> > <mailto:
> > > >> progers@mapr.com>> wrote:
> > > >>
> > > >> Thanks Aman for organizing the Hackathon!
> > > >>
> > > >> The list included many good ideas for Drill 2.0. Some of those
> require
> > > >> changes to Drill’s “public” interfaces (file format, client
> protocol,
> > > SQL
> > > >> behavior, etc.)
> > > >>
> > > >> At present, Drill has no good mechanism to handle backward/forward
> > > >> compatibility at the API level. Protobuf versioning certainly helps,
> > but
> > > >> can’t completely solve semantic changes (where a field changes
> > meaning,
> > > or
> > > >> a non-Protobuf data chunk changes format.) As just one concrete
> > example,
> > > >> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because
> class
> > > >> names and data formats will change.
> > > >>
> > > >> Perhaps we can prioritize, for the proposed 2.0 release, a one-time
> > set
> > > of
> > > >> breaking changes that introduce a versioning mechanism into our
> public
> > > >> APIs. Once these are in place, we can evolve the APIs in the future
> by
> > > >> following the newly-created versioning protocol.
> > > >>
> > > >> Without such a mechanism, we cannot support old & new clients in the
> > > same
> > > >> cluster. Nor can we support rolling upgrades. Of course, another
> > > solution
> > > >> is to get it right the second time, then freeze all APIs and agree
> to
> > > never
> > > >> again change them. Not sure we have sufficient access to a crystal
> > ball
> > > to
> > > >> predict everything we’d ever need in our APIs, however...
> > > >>
> > > >> Thanks,
> > > >>
> > > >> - Paul
> > > >>
> > > >> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org
> <mailto:
> > a
> > > >> mansinha@apache.org>> wrote:
> > > >>
> > > >> Drill Developers,
> > > >>
> > > >> In order to kick-start the Drill 2.0  release discussions, I would
> > like
> > > >> to
> > > >> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day
> ™ J
> > > ).
> > > >>
> > > >> As I mentioned in the hangout on Tuesday,  MapR has offered to host
> it
> > > on
> > > >> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> > > works
> > > >> for most of you!
> > > >>
> > > >> The goal is to get the community together for a day-long technical
> > > >> discussion on key topics in preparation for a Drill 2.0 release as
> > well
> > > >> as
> > > >> potential improvements in upcoming 1.xx releases.  Depending on the
> > > >> interest areas, we could form groups and have a volunteer lead each
> > > >> group.
> > > >>
> > > >> Based on prior discussions on the dev list, hangouts and existing
> > JIRAs,
> > > >> there is already a substantial set of topics and I have summarized a
> > few
> > > >> of
> > > >> them below.   What other topics do folks want to talk about?   Feel
> > free
> > > >> to
> > > >> respond to this thread and I will create a google doc to
> consolidate.
> > > >> Understandably, the list would be long but we will use the hackathon
> > to
> > > >> get
> > > >> a sense of a reasonable feature set for 1.xx and 2.0 releases.
> > > >>
> > > >>
> > > >> 1. Metadata management.
> > > >>
> > > >> 1a: Defining an abstraction layer for various types of metadata:
> > views,
> > > >> schema, statistics, security
> > > >>
> > > >> 1b: Underlying storage for metadata: what are the options and their
> > > >> trade-offs?
> > > >>
> > > >>    - Hive metastore
> > > >>
> > > >>    - Parquet metadata cache (parquet specific)
> > > >>
> > > >>    - An embedded DBMS
> > > >>
> > > >>    - A distributed key-value store
> > > >>
> > > >>    - Others..
> > > >>
> > > >>
> > > >>
> > > >> 2. Drill integration with Apache Arrow
> > > >>
> > > >> 2a: Evaluate the choices and tradeoffs
> > > >>
> > > >>
> > > >>
> > > >> 3. Resource management
> > > >>
> > > >> 3a: Memory limits per query
> > > >>
> > > >> 3b: Spilling
> > > >>
> > > >> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> > > >>
> > > >> 3d: Local vs. global resource management
> > > >>
> > > >> 3e: Aligning with admission control/queueing
> > > >>
> > > >>
> > > >>
> > > >> 4. TPC-DS coverage and related planner/operator enhancements
> > > >>
> > > >> 4a: Additional set operations: INTERSECT, EXCEPT
> > > >>
> > > >> 4b: GROUPING SETS, ROLLUP, CUBE support
> > > >>
> > > >> 4c: Handling inequality joins and cartesian joins of non-scalar
> inputs
> > > >> (via Nested Loop Join)
> > > >>
> > > >> 4d: Remaining gaps in correlated subquery
> > > >>
> > > >> 4e: Statistics: Number of Distinct Values, Histograms
> > > >>
> > > >>
> > > >>
> > > >> 5. Schema handling
> > > >>
> > > >> 5a: Creation, management of schema
> > > >>
> > > >> 5b: Handling schema changes in certain common cases
> > > >>
> > > >> 5c: Schema-awareness
> > > >>
> > > >> 5d: Others TBD
> > > >>
> > > >>
> > > >>
> > > >> 6. Concurrency
> > > >>
> > > >> 6a: What are the bottlenecks to achieving higher concurrency
> > > >>
> > > >> 6b: Ideas to address these..e.g async execution ?
> > > >>
> > > >>
> > > >>
> > > >> 7. Storage plugins,  REST APIs related enhancements
> > > >>
> > > >>  <Topics TBD>
> > > >>
> > > >>
> > > >>
> > > >> 8. Performance improvements
> > > >>
> > > >> 8a: Filter pushdown
> > > >>
> > > >> 8b: Vectorized Parquet reader
> > > >>
> > > >> 8c: Code-gen improvements
> > > >>
> > > >> 8d: Others TBD
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
>
-- 
Thanks & Regards,
B Anil Kumar.

Re: Drill 2.0 (design) hackathon

Posted by Charles Givre <cg...@gmail.com>.
Thank you Aman for organizing and to MapR for hosting!

On Wed, Sep 20, 2017 at 11:12 AM, Aman Sinha <am...@apache.org> wrote:

> Thanks to all the folks who attended the hackathon - both local and remote.
>   For the remote attendees, you missed out on a good dinner :)
>
> We had a day of excellent discussion on several topics:  Resource
> management, operator level performance improvements, TPC-DS coverage,
> metadata management, concurrency, usability and error handling, storage
> plugins + rest APIs.   It will take a couple of days to compile all the
> notes and we will post them.
>
> Since the focus was more in-depth discussion rather than breadth, and 1 day
> is clearly not adequate, some topics were left out.  We can continue those
> discussions on the dev list / hangout  or if it can wait, possibly do it in
> a future hackathon.
>
> -Aman
>
> On Fri, Sep 15, 2017 at 2:54 PM, Charles Givre <cg...@gmail.com> wrote:
>
> > Hi Pritesh,
> > What time do you think you’d want me to present?  Also, should I make
> some
> > slides?
> > Best,
> > — C
> >
> > > On Sep 15, 2017, at 13:23, Pritesh Maker <pm...@mapr.com> wrote:
> > >
> > > Hi All
> > >
> > > We are looking forward to hosting the hackathon on Monday. Just a few
> > updates on the logistics and agenda
> > >
> > > • We are expecting over 25 people attending the event – you can see the
> > attendee list at the Eventbrite site -  https://www.eventbrite.com/e/
> > drill-developer-day-sept-2017-registration-7478463285
> > >
> > > • Breakfast will be served starting at 8:30AM – we would like to begin
> > promptly at 9AM
> > >
> > > • The agenda has been updated to reflect the speakers (see the update
> in
> > the sheet - https://docs.google.com/spreadsheets/d/
> > 1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
> > > o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha
> > > o Community Contributions – Anil Kumar, John Omernik, Charles Givre and
> > Ted Dunning
> > > o Two tracks for technical design discussions – some topics have
> initial
> > thoughts for the topics and some will have open brainstorming discussions
> > > o Once the discussions are concluded, we will have summaries presented
> > and notes shared with the community
> > >
> > > • We will have a WebEx for the first two sessions. For the two tracks,
> > we will either continue the WebEx or have Hangout links (will publish
> them
> > to the google sheet)
> > > "JOIN WEBEX MEETING
> > > https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6
> c76
> > > Meeting number (access code): 806 111 950
> > > Meeting password: ApacheDrill"
> > >
> > > • For the attendees in person, we have made bookings for a dinner in
> the
> > evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas
> > >
> > > Looking forward to a fantastic day for the Apache Drill! community!
> > >
> > > Thanks,
> > > Pritesh
> > >
> > >
> > >
> > > On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:
> > >
> > >    Here is the Eventbrite event for registration:
> > >
> > >    https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> > registration-7478463285
> > >
> > >    Please register so we can plan for food and drinks appropriately.
> > >
> > >    The link also contains a google doc link for the preliminary agenda
> > and a
> > >    'Topics' tab with volunteer sign-up column.  Please add your name to
> > the
> > >    area(s) of interest.
> > >
> > >    Thanks and look forward to seeing you all !
> > >
> > >    -Aman
> > >
> > >    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com>
> > wrote:
> > >
> > >> A partial list of Drill’s public APIs:
> > >>
> > >> IMHO, highest priority for Drill 2.0.
> > >>
> > >>
> > >>  *   JDBC/ODBC drivers
> > >>  *   Client (for JDBC/ODBC) + ODBC & JDBC
> > >>  *   Client (for full Drill async, columnar)
> > >>  *   Storage plugin
> > >>  *   Format plugin
> > >>  *   System/session options
> > >>  *   Queueing (e.g. ZK-based queues)
> > >>  *   Rest API
> > >>  *   Resource Planning (e.g. max query memory per node)
> > >>  *   Metadata access, storage (e.g. file system locations vs. a
> > metastore)
> > >>  *   Metadata files formats (Parquet, views, etc.)
> > >>
> > >> Lower priority for future releases:
> > >>
> > >>
> > >>  *   Query Planning (e.g. Calcite rules)
> > >>  *   Config options
> > >>  *   SQL syntax, especially Drill extensions
> > >>  *   UDF
> > >>  *   Management (e.g. JMX, Rest API calls, etc.)
> > >>  *   Drill File System (HDFS)
> > >>  *   Web UI
> > >>  *   Shell scripts
> > >>
> > >> There are certainly more. Please suggest those that are missing. I’ve
> > >> taken a rough cut at which APIs need forward/backward compatibility
> > first,
> > >> in part based on those that are the “most public” and most likely to
> > >> change. Others are important, but we can’t do them all at once.
> > >>
> > >> Thanks,
> > >>
> > >> - Paul
> > >>
> > >> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:
> a
> > >> mansinha@apache.org>> wrote:
> > >>
> > >> Hi Paul,
> > >> certainly makes sense to have the API compatibility discussions during
> > this
> > >> hackathon.  The 2.0 release may be a good checkpoint to introduce
> > breaking
> > >> changes necessitating changes to the ODBC/JDBC drivers and other
> > external
> > >> applications. As part of this exercise (not during the hackathon but
> as
> > a
> > >> follow-up action), we also should clearly identify the "public"
> > interfaces.
> > >>
> > >>
> > >> I will add this to the agenda.
> > >>
> > >> thanks,
> > >> -Aman
> > >>
> > >> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com
> <mailto:
> > >> progers@mapr.com>> wrote:
> > >>
> > >> Thanks Aman for organizing the Hackathon!
> > >>
> > >> The list included many good ideas for Drill 2.0. Some of those require
> > >> changes to Drill’s “public” interfaces (file format, client protocol,
> > SQL
> > >> behavior, etc.)
> > >>
> > >> At present, Drill has no good mechanism to handle backward/forward
> > >> compatibility at the API level. Protobuf versioning certainly helps,
> but
> > >> can’t completely solve semantic changes (where a field changes
> meaning,
> > or
> > >> a non-Protobuf data chunk changes format.) As just one concrete
> example,
> > >> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> > >> names and data formats will change.
> > >>
> > >> Perhaps we can prioritize, for the proposed 2.0 release, a one-time
> set
> > of
> > >> breaking changes that introduce a versioning mechanism into our public
> > >> APIs. Once these are in place, we can evolve the APIs in the future by
> > >> following the newly-created versioning protocol.
> > >>
> > >> Without such a mechanism, we cannot support old & new clients in the
> > same
> > >> cluster. Nor can we support rolling upgrades. Of course, another
> > solution
> > >> is to get it right the second time, then freeze all APIs and agree to
> > never
> > >> again change them. Not sure we have sufficient access to a crystal
> ball
> > to
> > >> predict everything we’d ever need in our APIs, however...
> > >>
> > >> Thanks,
> > >>
> > >> - Paul
> > >>
> > >> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:
> a
> > >> mansinha@apache.org>> wrote:
> > >>
> > >> Drill Developers,
> > >>
> > >> In order to kick-start the Drill 2.0  release discussions, I would
> like
> > >> to
> > >> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J
> > ).
> > >>
> > >> As I mentioned in the hangout on Tuesday,  MapR has offered to host it
> > on
> > >> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> > works
> > >> for most of you!
> > >>
> > >> The goal is to get the community together for a day-long technical
> > >> discussion on key topics in preparation for a Drill 2.0 release as
> well
> > >> as
> > >> potential improvements in upcoming 1.xx releases.  Depending on the
> > >> interest areas, we could form groups and have a volunteer lead each
> > >> group.
> > >>
> > >> Based on prior discussions on the dev list, hangouts and existing
> JIRAs,
> > >> there is already a substantial set of topics and I have summarized a
> few
> > >> of
> > >> them below.   What other topics do folks want to talk about?   Feel
> free
> > >> to
> > >> respond to this thread and I will create a google doc to consolidate.
> > >> Understandably, the list would be long but we will use the hackathon
> to
> > >> get
> > >> a sense of a reasonable feature set for 1.xx and 2.0 releases.
> > >>
> > >>
> > >> 1. Metadata management.
> > >>
> > >> 1a: Defining an abstraction layer for various types of metadata:
> views,
> > >> schema, statistics, security
> > >>
> > >> 1b: Underlying storage for metadata: what are the options and their
> > >> trade-offs?
> > >>
> > >>    - Hive metastore
> > >>
> > >>    - Parquet metadata cache (parquet specific)
> > >>
> > >>    - An embedded DBMS
> > >>
> > >>    - A distributed key-value store
> > >>
> > >>    - Others..
> > >>
> > >>
> > >>
> > >> 2. Drill integration with Apache Arrow
> > >>
> > >> 2a: Evaluate the choices and tradeoffs
> > >>
> > >>
> > >>
> > >> 3. Resource management
> > >>
> > >> 3a: Memory limits per query
> > >>
> > >> 3b: Spilling
> > >>
> > >> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> > >>
> > >> 3d: Local vs. global resource management
> > >>
> > >> 3e: Aligning with admission control/queueing
> > >>
> > >>
> > >>
> > >> 4. TPC-DS coverage and related planner/operator enhancements
> > >>
> > >> 4a: Additional set operations: INTERSECT, EXCEPT
> > >>
> > >> 4b: GROUPING SETS, ROLLUP, CUBE support
> > >>
> > >> 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > >> (via Nested Loop Join)
> > >>
> > >> 4d: Remaining gaps in correlated subquery
> > >>
> > >> 4e: Statistics: Number of Distinct Values, Histograms
> > >>
> > >>
> > >>
> > >> 5. Schema handling
> > >>
> > >> 5a: Creation, management of schema
> > >>
> > >> 5b: Handling schema changes in certain common cases
> > >>
> > >> 5c: Schema-awareness
> > >>
> > >> 5d: Others TBD
> > >>
> > >>
> > >>
> > >> 6. Concurrency
> > >>
> > >> 6a: What are the bottlenecks to achieving higher concurrency
> > >>
> > >> 6b: Ideas to address these..e.g async execution ?
> > >>
> > >>
> > >>
> > >> 7. Storage plugins,  REST APIs related enhancements
> > >>
> > >>  <Topics TBD>
> > >>
> > >>
> > >>
> > >> 8. Performance improvements
> > >>
> > >> 8a: Filter pushdown
> > >>
> > >> 8b: Vectorized Parquet reader
> > >>
> > >> 8c: Code-gen improvements
> > >>
> > >> 8d: Others TBD
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
>

RE: Drill 2.0 (design) hackathon

Posted by Kunal Khatua <kk...@mapr.com>.
I think that's a good idea. 
We could put this up in a list (in the google doc) of items to discuss on the hangout. That way, if we have no pressing topics to discuss, we can certainly pick something from the list .

-----Original Message-----
From: Aman Sinha [mailto:amansinha@apache.org] 
Sent: Wednesday, September 20, 2017 8:13 AM
To: dev@drill.apache.org
Subject: Re: Drill 2.0 (design) hackathon

Thanks to all the folks who attended the hackathon - both local and remote.
  For the remote attendees, you missed out on a good dinner :)

We had a day of excellent discussion on several topics:  Resource management, operator level performance improvements, TPC-DS coverage, metadata management, concurrency, usability and error handling, storage
plugins + rest APIs.   It will take a couple of days to compile all the
notes and we will post them.

Since the focus was more in-depth discussion rather than breadth, and 1 day is clearly not adequate, some topics were left out.  We can continue those discussions on the dev list / hangout  or if it can wait, possibly do it in a future hackathon.

-Aman

On Fri, Sep 15, 2017 at 2:54 PM, Charles Givre <cg...@gmail.com> wrote:

> Hi Pritesh,
> What time do you think you’d want me to present?  Also, should I make 
> some slides?
> Best,
> — C
>
> > On Sep 15, 2017, at 13:23, Pritesh Maker <pm...@mapr.com> wrote:
> >
> > Hi All
> >
> > We are looking forward to hosting the hackathon on Monday. Just a 
> > few
> updates on the logistics and agenda
> >
> > • We are expecting over 25 people attending the event – you can see 
> > the
> attendee list at the Eventbrite site -  https://www.eventbrite.com/e/
> drill-developer-day-sept-2017-registration-7478463285
> >
> > • Breakfast will be served starting at 8:30AM – we would like to 
> > begin
> promptly at 9AM
> >
> > • The agenda has been updated to reflect the speakers (see the 
> > update in
> the sheet - https://docs.google.com/spreadsheets/d/
> 1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
> > o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman 
> > Sinha o Community Contributions – Anil Kumar, John Omernik, Charles 
> > Givre and
> Ted Dunning
> > o Two tracks for technical design discussions – some topics have 
> > initial
> thoughts for the topics and some will have open brainstorming 
> discussions
> > o Once the discussions are concluded, we will have summaries 
> > presented
> and notes shared with the community
> >
> > • We will have a WebEx for the first two sessions. For the two 
> > tracks,
> we will either continue the WebEx or have Hangout links (will publish 
> them to the google sheet)
> > "JOIN WEBEX MEETING
> > https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c
> > 6c76 Meeting number (access code): 806 111 950 Meeting password: 
> > ApacheDrill"
> >
> > • For the attendees in person, we have made bookings for a dinner in 
> > the
> evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas
> >
> > Looking forward to a fantastic day for the Apache Drill! community!
> >
> > Thanks,
> > Pritesh
> >
> >
> >
> > On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:
> >
> >    Here is the Eventbrite event for registration:
> >
> >    https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> registration-7478463285
> >
> >    Please register so we can plan for food and drinks appropriately.
> >
> >    The link also contains a google doc link for the preliminary 
> > agenda
> and a
> >    'Topics' tab with volunteer sign-up column.  Please add your name 
> > to
> the
> >    area(s) of interest.
> >
> >    Thanks and look forward to seeing you all !
> >
> >    -Aman
> >
> >    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com>
> wrote:
> >
> >> A partial list of Drill’s public APIs:
> >>
> >> IMHO, highest priority for Drill 2.0.
> >>
> >>
> >>  *   JDBC/ODBC drivers
> >>  *   Client (for JDBC/ODBC) + ODBC & JDBC
> >>  *   Client (for full Drill async, columnar)
> >>  *   Storage plugin
> >>  *   Format plugin
> >>  *   System/session options
> >>  *   Queueing (e.g. ZK-based queues)
> >>  *   Rest API
> >>  *   Resource Planning (e.g. max query memory per node)
> >>  *   Metadata access, storage (e.g. file system locations vs. a
> metastore)
> >>  *   Metadata files formats (Parquet, views, etc.)
> >>
> >> Lower priority for future releases:
> >>
> >>
> >>  *   Query Planning (e.g. Calcite rules)
> >>  *   Config options
> >>  *   SQL syntax, especially Drill extensions
> >>  *   UDF
> >>  *   Management (e.g. JMX, Rest API calls, etc.)
> >>  *   Drill File System (HDFS)
> >>  *   Web UI
> >>  *   Shell scripts
> >>
> >> There are certainly more. Please suggest those that are missing. 
> >> I’ve taken a rough cut at which APIs need forward/backward 
> >> compatibility
> first,
> >> in part based on those that are the “most public” and most likely 
> >> to change. Others are important, but we can’t do them all at once.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> On Aug 29, 2017, at 6:00 PM, Aman Sinha 
> >> <amansinha@apache.org<mailto:a mansinha@apache.org>> wrote:
> >>
> >> Hi Paul,
> >> certainly makes sense to have the API compatibility discussions 
> >> during
> this
> >> hackathon.  The 2.0 release may be a good checkpoint to introduce
> breaking
> >> changes necessitating changes to the ODBC/JDBC drivers and other
> external
> >> applications. As part of this exercise (not during the hackathon 
> >> but as
> a
> >> follow-up action), we also should clearly identify the "public"
> interfaces.
> >>
> >>
> >> I will add this to the agenda.
> >>
> >> thanks,
> >> -Aman
> >>
> >> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> >> progers@mapr.com>> wrote:
> >>
> >> Thanks Aman for organizing the Hackathon!
> >>
> >> The list included many good ideas for Drill 2.0. Some of those 
> >> require changes to Drill’s “public” interfaces (file format, client 
> >> protocol,
> SQL
> >> behavior, etc.)
> >>
> >> At present, Drill has no good mechanism to handle backward/forward 
> >> compatibility at the API level. Protobuf versioning certainly 
> >> helps, but can’t completely solve semantic changes (where a field 
> >> changes meaning,
> or
> >> a non-Protobuf data chunk changes format.) As just one concrete 
> >> example, changing to Arrow will break pre-Arrow ODBC/JDBC drivers 
> >> because class names and data formats will change.
> >>
> >> Perhaps we can prioritize, for the proposed 2.0 release, a one-time 
> >> set
> of
> >> breaking changes that introduce a versioning mechanism into our 
> >> public APIs. Once these are in place, we can evolve the APIs in the 
> >> future by following the newly-created versioning protocol.
> >>
> >> Without such a mechanism, we cannot support old & new clients in 
> >> the
> same
> >> cluster. Nor can we support rolling upgrades. Of course, another
> solution
> >> is to get it right the second time, then freeze all APIs and agree 
> >> to
> never
> >> again change them. Not sure we have sufficient access to a crystal 
> >> ball
> to
> >> predict everything we’d ever need in our APIs, however...
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> On Aug 24, 2017, at 8:39 AM, Aman Sinha 
> >> <amansinha@apache.org<mailto:a mansinha@apache.org>> wrote:
> >>
> >> Drill Developers,
> >>
> >> In order to kick-start the Drill 2.0  release discussions, I would 
> >> like to propose a Drill 2.0  (design) hackathon (a.k.a Drill 
> >> Developer Day ™ J
> ).
> >>
> >> As I mentioned in the hangout on Tuesday,  MapR has offered to host 
> >> it
> on
> >> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> works
> >> for most of you!
> >>
> >> The goal is to get the community together for a day-long technical 
> >> discussion on key topics in preparation for a Drill 2.0 release as 
> >> well as potential improvements in upcoming 1.xx releases.  
> >> Depending on the interest areas, we could form groups and have a 
> >> volunteer lead each group.
> >>
> >> Based on prior discussions on the dev list, hangouts and existing 
> >> JIRAs, there is already a substantial set of topics and I have 
> >> summarized a few of
> >> them below.   What other topics do folks want to talk about?   Feel free
> >> to
> >> respond to this thread and I will create a google doc to consolidate.
> >> Understandably, the list would be long but we will use the 
> >> hackathon to get a sense of a reasonable feature set for 1.xx and 
> >> 2.0 releases.
> >>
> >>
> >> 1. Metadata management.
> >>
> >> 1a: Defining an abstraction layer for various types of metadata: 
> >> views, schema, statistics, security
> >>
> >> 1b: Underlying storage for metadata: what are the options and their 
> >> trade-offs?
> >>
> >>    - Hive metastore
> >>
> >>    - Parquet metadata cache (parquet specific)
> >>
> >>    - An embedded DBMS
> >>
> >>    - A distributed key-value store
> >>
> >>    - Others..
> >>
> >>
> >>
> >> 2. Drill integration with Apache Arrow
> >>
> >> 2a: Evaluate the choices and tradeoffs
> >>
> >>
> >>
> >> 3. Resource management
> >>
> >> 3a: Memory limits per query
> >>
> >> 3b: Spilling
> >>
> >> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >>
> >> 3d: Local vs. global resource management
> >>
> >> 3e: Aligning with admission control/queueing
> >>
> >>
> >>
> >> 4. TPC-DS coverage and related planner/operator enhancements
> >>
> >> 4a: Additional set operations: INTERSECT, EXCEPT
> >>
> >> 4b: GROUPING SETS, ROLLUP, CUBE support
> >>
> >> 4c: Handling inequality joins and cartesian joins of non-scalar 
> >> inputs (via Nested Loop Join)
> >>
> >> 4d: Remaining gaps in correlated subquery
> >>
> >> 4e: Statistics: Number of Distinct Values, Histograms
> >>
> >>
> >>
> >> 5. Schema handling
> >>
> >> 5a: Creation, management of schema
> >>
> >> 5b: Handling schema changes in certain common cases
> >>
> >> 5c: Schema-awareness
> >>
> >> 5d: Others TBD
> >>
> >>
> >>
> >> 6. Concurrency
> >>
> >> 6a: What are the bottlenecks to achieving higher concurrency
> >>
> >> 6b: Ideas to address these..e.g async execution ?
> >>
> >>
> >>
> >> 7. Storage plugins,  REST APIs related enhancements
> >>
> >>  <Topics TBD>
> >>
> >>
> >>
> >> 8. Performance improvements
> >>
> >> 8a: Filter pushdown
> >>
> >> 8b: Vectorized Parquet reader
> >>
> >> 8c: Code-gen improvements
> >>
> >> 8d: Others TBD
> >>
> >>
> >>
> >>
> >
> >
>
>

Re: Drill 2.0 (design) hackathon

Posted by Aman Sinha <am...@apache.org>.
Thanks to all the folks who attended the hackathon - both local and remote.
  For the remote attendees, you missed out on a good dinner :)

We had a day of excellent discussion on several topics:  Resource
management, operator level performance improvements, TPC-DS coverage,
metadata management, concurrency, usability and error handling, storage
plugins + rest APIs.   It will take a couple of days to compile all the
notes and we will post them.

Since the focus was more in-depth discussion rather than breadth, and 1 day
is clearly not adequate, some topics were left out.  We can continue those
discussions on the dev list / hangout  or if it can wait, possibly do it in
a future hackathon.

-Aman

On Fri, Sep 15, 2017 at 2:54 PM, Charles Givre <cg...@gmail.com> wrote:

> Hi Pritesh,
> What time do you think you’d want me to present?  Also, should I make some
> slides?
> Best,
> — C
>
> > On Sep 15, 2017, at 13:23, Pritesh Maker <pm...@mapr.com> wrote:
> >
> > Hi All
> >
> > We are looking forward to hosting the hackathon on Monday. Just a few
> updates on the logistics and agenda
> >
> > • We are expecting over 25 people attending the event – you can see the
> attendee list at the Eventbrite site -  https://www.eventbrite.com/e/
> drill-developer-day-sept-2017-registration-7478463285
> >
> > • Breakfast will be served starting at 8:30AM – we would like to begin
> promptly at 9AM
> >
> > • The agenda has been updated to reflect the speakers (see the update in
> the sheet - https://docs.google.com/spreadsheets/d/
> 1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
> > o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha
> > o Community Contributions – Anil Kumar, John Omernik, Charles Givre and
> Ted Dunning
> > o Two tracks for technical design discussions – some topics have initial
> thoughts for the topics and some will have open brainstorming discussions
> > o Once the discussions are concluded, we will have summaries presented
> and notes shared with the community
> >
> > • We will have a WebEx for the first two sessions. For the two tracks,
> we will either continue the WebEx or have Hangout links (will publish them
> to the google sheet)
> > "JOIN WEBEX MEETING
> > https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76
> > Meeting number (access code): 806 111 950
> > Meeting password: ApacheDrill"
> >
> > • For the attendees in person, we have made bookings for a dinner in the
> evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas
> >
> > Looking forward to a fantastic day for the Apache Drill! community!
> >
> > Thanks,
> > Pritesh
> >
> >
> >
> > On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:
> >
> >    Here is the Eventbrite event for registration:
> >
> >    https://www.eventbrite.com/e/drill-developer-day-sept-2017-
> registration-7478463285
> >
> >    Please register so we can plan for food and drinks appropriately.
> >
> >    The link also contains a google doc link for the preliminary agenda
> and a
> >    'Topics' tab with volunteer sign-up column.  Please add your name to
> the
> >    area(s) of interest.
> >
> >    Thanks and look forward to seeing you all !
> >
> >    -Aman
> >
> >    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com>
> wrote:
> >
> >> A partial list of Drill’s public APIs:
> >>
> >> IMHO, highest priority for Drill 2.0.
> >>
> >>
> >>  *   JDBC/ODBC drivers
> >>  *   Client (for JDBC/ODBC) + ODBC & JDBC
> >>  *   Client (for full Drill async, columnar)
> >>  *   Storage plugin
> >>  *   Format plugin
> >>  *   System/session options
> >>  *   Queueing (e.g. ZK-based queues)
> >>  *   Rest API
> >>  *   Resource Planning (e.g. max query memory per node)
> >>  *   Metadata access, storage (e.g. file system locations vs. a
> metastore)
> >>  *   Metadata files formats (Parquet, views, etc.)
> >>
> >> Lower priority for future releases:
> >>
> >>
> >>  *   Query Planning (e.g. Calcite rules)
> >>  *   Config options
> >>  *   SQL syntax, especially Drill extensions
> >>  *   UDF
> >>  *   Management (e.g. JMX, Rest API calls, etc.)
> >>  *   Drill File System (HDFS)
> >>  *   Web UI
> >>  *   Shell scripts
> >>
> >> There are certainly more. Please suggest those that are missing. I’ve
> >> taken a rough cut at which APIs need forward/backward compatibility
> first,
> >> in part based on those that are the “most public” and most likely to
> >> change. Others are important, but we can’t do them all at once.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
> >> mansinha@apache.org>> wrote:
> >>
> >> Hi Paul,
> >> certainly makes sense to have the API compatibility discussions during
> this
> >> hackathon.  The 2.0 release may be a good checkpoint to introduce
> breaking
> >> changes necessitating changes to the ODBC/JDBC drivers and other
> external
> >> applications. As part of this exercise (not during the hackathon but as
> a
> >> follow-up action), we also should clearly identify the "public"
> interfaces.
> >>
> >>
> >> I will add this to the agenda.
> >>
> >> thanks,
> >> -Aman
> >>
> >> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> >> progers@mapr.com>> wrote:
> >>
> >> Thanks Aman for organizing the Hackathon!
> >>
> >> The list included many good ideas for Drill 2.0. Some of those require
> >> changes to Drill’s “public” interfaces (file format, client protocol,
> SQL
> >> behavior, etc.)
> >>
> >> At present, Drill has no good mechanism to handle backward/forward
> >> compatibility at the API level. Protobuf versioning certainly helps, but
> >> can’t completely solve semantic changes (where a field changes meaning,
> or
> >> a non-Protobuf data chunk changes format.) As just one concrete example,
> >> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> >> names and data formats will change.
> >>
> >> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set
> of
> >> breaking changes that introduce a versioning mechanism into our public
> >> APIs. Once these are in place, we can evolve the APIs in the future by
> >> following the newly-created versioning protocol.
> >>
> >> Without such a mechanism, we cannot support old & new clients in the
> same
> >> cluster. Nor can we support rolling upgrades. Of course, another
> solution
> >> is to get it right the second time, then freeze all APIs and agree to
> never
> >> again change them. Not sure we have sufficient access to a crystal ball
> to
> >> predict everything we’d ever need in our APIs, however...
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
> >> mansinha@apache.org>> wrote:
> >>
> >> Drill Developers,
> >>
> >> In order to kick-start the Drill 2.0  release discussions, I would like
> >> to
> >> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J
> ).
> >>
> >> As I mentioned in the hangout on Tuesday,  MapR has offered to host it
> on
> >> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that
> works
> >> for most of you!
> >>
> >> The goal is to get the community together for a day-long technical
> >> discussion on key topics in preparation for a Drill 2.0 release as well
> >> as
> >> potential improvements in upcoming 1.xx releases.  Depending on the
> >> interest areas, we could form groups and have a volunteer lead each
> >> group.
> >>
> >> Based on prior discussions on the dev list, hangouts and existing JIRAs,
> >> there is already a substantial set of topics and I have summarized a few
> >> of
> >> them below.   What other topics do folks want to talk about?   Feel free
> >> to
> >> respond to this thread and I will create a google doc to consolidate.
> >> Understandably, the list would be long but we will use the hackathon to
> >> get
> >> a sense of a reasonable feature set for 1.xx and 2.0 releases.
> >>
> >>
> >> 1. Metadata management.
> >>
> >> 1a: Defining an abstraction layer for various types of metadata: views,
> >> schema, statistics, security
> >>
> >> 1b: Underlying storage for metadata: what are the options and their
> >> trade-offs?
> >>
> >>    - Hive metastore
> >>
> >>    - Parquet metadata cache (parquet specific)
> >>
> >>    - An embedded DBMS
> >>
> >>    - A distributed key-value store
> >>
> >>    - Others..
> >>
> >>
> >>
> >> 2. Drill integration with Apache Arrow
> >>
> >> 2a: Evaluate the choices and tradeoffs
> >>
> >>
> >>
> >> 3. Resource management
> >>
> >> 3a: Memory limits per query
> >>
> >> 3b: Spilling
> >>
> >> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >>
> >> 3d: Local vs. global resource management
> >>
> >> 3e: Aligning with admission control/queueing
> >>
> >>
> >>
> >> 4. TPC-DS coverage and related planner/operator enhancements
> >>
> >> 4a: Additional set operations: INTERSECT, EXCEPT
> >>
> >> 4b: GROUPING SETS, ROLLUP, CUBE support
> >>
> >> 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> >> (via Nested Loop Join)
> >>
> >> 4d: Remaining gaps in correlated subquery
> >>
> >> 4e: Statistics: Number of Distinct Values, Histograms
> >>
> >>
> >>
> >> 5. Schema handling
> >>
> >> 5a: Creation, management of schema
> >>
> >> 5b: Handling schema changes in certain common cases
> >>
> >> 5c: Schema-awareness
> >>
> >> 5d: Others TBD
> >>
> >>
> >>
> >> 6. Concurrency
> >>
> >> 6a: What are the bottlenecks to achieving higher concurrency
> >>
> >> 6b: Ideas to address these..e.g async execution ?
> >>
> >>
> >>
> >> 7. Storage plugins,  REST APIs related enhancements
> >>
> >>  <Topics TBD>
> >>
> >>
> >>
> >> 8. Performance improvements
> >>
> >> 8a: Filter pushdown
> >>
> >> 8b: Vectorized Parquet reader
> >>
> >> 8c: Code-gen improvements
> >>
> >> 8d: Others TBD
> >>
> >>
> >>
> >>
> >
> >
>
>

Re: Drill 2.0 (design) hackathon

Posted by Charles Givre <cg...@gmail.com>.
Hi Pritesh, 
What time do you think you’d want me to present?  Also, should I make some slides?  
Best,
— C

> On Sep 15, 2017, at 13:23, Pritesh Maker <pm...@mapr.com> wrote:
> 
> Hi All
> 
> We are looking forward to hosting the hackathon on Monday. Just a few updates on the logistics and agenda
> 
> • We are expecting over 25 people attending the event – you can see the attendee list at the Eventbrite site -  https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 
> 
> • Breakfast will be served starting at 8:30AM – we would like to begin promptly at 9AM 
> 
> • The agenda has been updated to reflect the speakers (see the update in the sheet - https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
> o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha 
> o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted Dunning 
> o Two tracks for technical design discussions – some topics have initial thoughts for the topics and some will have open brainstorming discussions
> o Once the discussions are concluded, we will have summaries presented and notes shared with the community
> 
> • We will have a WebEx for the first two sessions. For the two tracks, we will either continue the WebEx or have Hangout links (will publish them to the google sheet)
> "JOIN WEBEX MEETING
> https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76
> Meeting number (access code): 806 111 950
> Meeting password: ApacheDrill"
> 
> • For the attendees in person, we have made bookings for a dinner in the evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas 
> 
> Looking forward to a fantastic day for the Apache Drill! community!
> 
> Thanks,
> Pritesh
> 
> 
> 
> On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:
> 
>    Here is the Eventbrite event for registration:
> 
>    https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285
> 
>    Please register so we can plan for food and drinks appropriately.
> 
>    The link also contains a google doc link for the preliminary agenda and a
>    'Topics' tab with volunteer sign-up column.  Please add your name to the
>    area(s) of interest.
> 
>    Thanks and look forward to seeing you all !
> 
>    -Aman
> 
>    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com> wrote:
> 
>> A partial list of Drill’s public APIs:
>> 
>> IMHO, highest priority for Drill 2.0.
>> 
>> 
>>  *   JDBC/ODBC drivers
>>  *   Client (for JDBC/ODBC) + ODBC & JDBC
>>  *   Client (for full Drill async, columnar)
>>  *   Storage plugin
>>  *   Format plugin
>>  *   System/session options
>>  *   Queueing (e.g. ZK-based queues)
>>  *   Rest API
>>  *   Resource Planning (e.g. max query memory per node)
>>  *   Metadata access, storage (e.g. file system locations vs. a metastore)
>>  *   Metadata files formats (Parquet, views, etc.)
>> 
>> Lower priority for future releases:
>> 
>> 
>>  *   Query Planning (e.g. Calcite rules)
>>  *   Config options
>>  *   SQL syntax, especially Drill extensions
>>  *   UDF
>>  *   Management (e.g. JMX, Rest API calls, etc.)
>>  *   Drill File System (HDFS)
>>  *   Web UI
>>  *   Shell scripts
>> 
>> There are certainly more. Please suggest those that are missing. I’ve
>> taken a rough cut at which APIs need forward/backward compatibility first,
>> in part based on those that are the “most public” and most likely to
>> change. Others are important, but we can’t do them all at once.
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
>> mansinha@apache.org>> wrote:
>> 
>> Hi Paul,
>> certainly makes sense to have the API compatibility discussions during this
>> hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
>> changes necessitating changes to the ODBC/JDBC drivers and other external
>> applications. As part of this exercise (not during the hackathon but as a
>> follow-up action), we also should clearly identify the "public" interfaces.
>> 
>> 
>> I will add this to the agenda.
>> 
>> thanks,
>> -Aman
>> 
>> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
>> progers@mapr.com>> wrote:
>> 
>> Thanks Aman for organizing the Hackathon!
>> 
>> The list included many good ideas for Drill 2.0. Some of those require
>> changes to Drill’s “public” interfaces (file format, client protocol, SQL
>> behavior, etc.)
>> 
>> At present, Drill has no good mechanism to handle backward/forward
>> compatibility at the API level. Protobuf versioning certainly helps, but
>> can’t completely solve semantic changes (where a field changes meaning, or
>> a non-Protobuf data chunk changes format.) As just one concrete example,
>> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
>> names and data formats will change.
>> 
>> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
>> breaking changes that introduce a versioning mechanism into our public
>> APIs. Once these are in place, we can evolve the APIs in the future by
>> following the newly-created versioning protocol.
>> 
>> Without such a mechanism, we cannot support old & new clients in the same
>> cluster. Nor can we support rolling upgrades. Of course, another solution
>> is to get it right the second time, then freeze all APIs and agree to never
>> again change them. Not sure we have sufficient access to a crystal ball to
>> predict everything we’d ever need in our APIs, however...
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
>> mansinha@apache.org>> wrote:
>> 
>> Drill Developers,
>> 
>> In order to kick-start the Drill 2.0  release discussions, I would like
>> to
>> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
>> 
>> As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
>> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
>> for most of you!
>> 
>> The goal is to get the community together for a day-long technical
>> discussion on key topics in preparation for a Drill 2.0 release as well
>> as
>> potential improvements in upcoming 1.xx releases.  Depending on the
>> interest areas, we could form groups and have a volunteer lead each
>> group.
>> 
>> Based on prior discussions on the dev list, hangouts and existing JIRAs,
>> there is already a substantial set of topics and I have summarized a few
>> of
>> them below.   What other topics do folks want to talk about?   Feel free
>> to
>> respond to this thread and I will create a google doc to consolidate.
>> Understandably, the list would be long but we will use the hackathon to
>> get
>> a sense of a reasonable feature set for 1.xx and 2.0 releases.
>> 
>> 
>> 1. Metadata management.
>> 
>> 1a: Defining an abstraction layer for various types of metadata: views,
>> schema, statistics, security
>> 
>> 1b: Underlying storage for metadata: what are the options and their
>> trade-offs?
>> 
>>    - Hive metastore
>> 
>>    - Parquet metadata cache (parquet specific)
>> 
>>    - An embedded DBMS
>> 
>>    - A distributed key-value store
>> 
>>    - Others..
>> 
>> 
>> 
>> 2. Drill integration with Apache Arrow
>> 
>> 2a: Evaluate the choices and tradeoffs
>> 
>> 
>> 
>> 3. Resource management
>> 
>> 3a: Memory limits per query
>> 
>> 3b: Spilling
>> 
>> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
>> 
>> 3d: Local vs. global resource management
>> 
>> 3e: Aligning with admission control/queueing
>> 
>> 
>> 
>> 4. TPC-DS coverage and related planner/operator enhancements
>> 
>> 4a: Additional set operations: INTERSECT, EXCEPT
>> 
>> 4b: GROUPING SETS, ROLLUP, CUBE support
>> 
>> 4c: Handling inequality joins and cartesian joins of non-scalar inputs
>> (via Nested Loop Join)
>> 
>> 4d: Remaining gaps in correlated subquery
>> 
>> 4e: Statistics: Number of Distinct Values, Histograms
>> 
>> 
>> 
>> 5. Schema handling
>> 
>> 5a: Creation, management of schema
>> 
>> 5b: Handling schema changes in certain common cases
>> 
>> 5c: Schema-awareness
>> 
>> 5d: Others TBD
>> 
>> 
>> 
>> 6. Concurrency
>> 
>> 6a: What are the bottlenecks to achieving higher concurrency
>> 
>> 6b: Ideas to address these..e.g async execution ?
>> 
>> 
>> 
>> 7. Storage plugins,  REST APIs related enhancements
>> 
>>  <Topics TBD>
>> 
>> 
>> 
>> 8. Performance improvements
>> 
>> 8a: Filter pushdown
>> 
>> 8b: Vectorized Parquet reader
>> 
>> 8c: Code-gen improvements
>> 
>> 8d: Others TBD
>> 
>> 
>> 
>> 
> 
> 


Re: Drill 2.0 (design) hackathon

Posted by Pritesh Maker <pm...@mapr.com>.
Hi All

We are looking forward to hosting the hackathon on Monday. Just a few updates on the logistics and agenda

• We are expecting over 25 people attending the event – you can see the attendee list at the Eventbrite site -  https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 

• Breakfast will be served starting at 8:30AM – we would like to begin promptly at 9AM 

• The agenda has been updated to reflect the speakers (see the update in the sheet - https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 )
o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha 
o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted Dunning 
o Two tracks for technical design discussions – some topics have initial thoughts for the topics and some will have open brainstorming discussions
o Once the discussions are concluded, we will have summaries presented and notes shared with the community

• We will have a WebEx for the first two sessions. For the two tracks, we will either continue the WebEx or have Hangout links (will publish them to the google sheet)
"JOIN WEBEX MEETING
https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76
Meeting number (access code): 806 111 950
Meeting password: ApacheDrill"

• For the attendees in person, we have made bookings for a dinner in the evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas 

Looking forward to a fantastic day for the Apache Drill! community!

Thanks,
Pritesh



On 9/5/17, 10:47 PM, "Aman Sinha" <am...@apache.org> wrote:

    Here is the Eventbrite event for registration:
    
    https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285
    
    Please register so we can plan for food and drinks appropriately.
    
    The link also contains a google doc link for the preliminary agenda and a
    'Topics' tab with volunteer sign-up column.  Please add your name to the
    area(s) of interest.
    
    Thanks and look forward to seeing you all !
    
    -Aman
    
    On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com> wrote:
    
    > A partial list of Drill’s public APIs:
    >
    > IMHO, highest priority for Drill 2.0.
    >
    >
    >   *   JDBC/ODBC drivers
    >   *   Client (for JDBC/ODBC) + ODBC & JDBC
    >   *   Client (for full Drill async, columnar)
    >   *   Storage plugin
    >   *   Format plugin
    >   *   System/session options
    >   *   Queueing (e.g. ZK-based queues)
    >   *   Rest API
    >   *   Resource Planning (e.g. max query memory per node)
    >   *   Metadata access, storage (e.g. file system locations vs. a metastore)
    >   *   Metadata files formats (Parquet, views, etc.)
    >
    > Lower priority for future releases:
    >
    >
    >   *   Query Planning (e.g. Calcite rules)
    >   *   Config options
    >   *   SQL syntax, especially Drill extensions
    >   *   UDF
    >   *   Management (e.g. JMX, Rest API calls, etc.)
    >   *   Drill File System (HDFS)
    >   *   Web UI
    >   *   Shell scripts
    >
    > There are certainly more. Please suggest those that are missing. I’ve
    > taken a rough cut at which APIs need forward/backward compatibility first,
    > in part based on those that are the “most public” and most likely to
    > change. Others are important, but we can’t do them all at once.
    >
    > Thanks,
    >
    > - Paul
    >
    > On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
    > mansinha@apache.org>> wrote:
    >
    > Hi Paul,
    > certainly makes sense to have the API compatibility discussions during this
    > hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
    > changes necessitating changes to the ODBC/JDBC drivers and other external
    > applications. As part of this exercise (not during the hackathon but as a
    > follow-up action), we also should clearly identify the "public" interfaces.
    >
    >
    > I will add this to the agenda.
    >
    > thanks,
    > -Aman
    >
    > On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
    > progers@mapr.com>> wrote:
    >
    > Thanks Aman for organizing the Hackathon!
    >
    > The list included many good ideas for Drill 2.0. Some of those require
    > changes to Drill’s “public” interfaces (file format, client protocol, SQL
    > behavior, etc.)
    >
    > At present, Drill has no good mechanism to handle backward/forward
    > compatibility at the API level. Protobuf versioning certainly helps, but
    > can’t completely solve semantic changes (where a field changes meaning, or
    > a non-Protobuf data chunk changes format.) As just one concrete example,
    > changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
    > names and data formats will change.
    >
    > Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
    > breaking changes that introduce a versioning mechanism into our public
    > APIs. Once these are in place, we can evolve the APIs in the future by
    > following the newly-created versioning protocol.
    >
    > Without such a mechanism, we cannot support old & new clients in the same
    > cluster. Nor can we support rolling upgrades. Of course, another solution
    > is to get it right the second time, then freeze all APIs and agree to never
    > again change them. Not sure we have sufficient access to a crystal ball to
    > predict everything we’d ever need in our APIs, however...
    >
    > Thanks,
    >
    > - Paul
    >
    > On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
    > mansinha@apache.org>> wrote:
    >
    > Drill Developers,
    >
    > In order to kick-start the Drill 2.0  release discussions, I would like
    > to
    > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
    >
    > As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
    > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
    > for most of you!
    >
    > The goal is to get the community together for a day-long technical
    > discussion on key topics in preparation for a Drill 2.0 release as well
    > as
    > potential improvements in upcoming 1.xx releases.  Depending on the
    > interest areas, we could form groups and have a volunteer lead each
    > group.
    >
    > Based on prior discussions on the dev list, hangouts and existing JIRAs,
    > there is already a substantial set of topics and I have summarized a few
    > of
    > them below.   What other topics do folks want to talk about?   Feel free
    > to
    > respond to this thread and I will create a google doc to consolidate.
    > Understandably, the list would be long but we will use the hackathon to
    > get
    > a sense of a reasonable feature set for 1.xx and 2.0 releases.
    >
    >
    > 1. Metadata management.
    >
    > 1a: Defining an abstraction layer for various types of metadata: views,
    > schema, statistics, security
    >
    > 1b: Underlying storage for metadata: what are the options and their
    > trade-offs?
    >
    >     - Hive metastore
    >
    >     - Parquet metadata cache (parquet specific)
    >
    >     - An embedded DBMS
    >
    >     - A distributed key-value store
    >
    >     - Others..
    >
    >
    >
    > 2. Drill integration with Apache Arrow
    >
    > 2a: Evaluate the choices and tradeoffs
    >
    >
    >
    > 3. Resource management
    >
    > 3a: Memory limits per query
    >
    > 3b: Spilling
    >
    > 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
    >
    > 3d: Local vs. global resource management
    >
    > 3e: Aligning with admission control/queueing
    >
    >
    >
    > 4. TPC-DS coverage and related planner/operator enhancements
    >
    > 4a: Additional set operations: INTERSECT, EXCEPT
    >
    > 4b: GROUPING SETS, ROLLUP, CUBE support
    >
    > 4c: Handling inequality joins and cartesian joins of non-scalar inputs
    > (via Nested Loop Join)
    >
    > 4d: Remaining gaps in correlated subquery
    >
    > 4e: Statistics: Number of Distinct Values, Histograms
    >
    >
    >
    > 5. Schema handling
    >
    > 5a: Creation, management of schema
    >
    > 5b: Handling schema changes in certain common cases
    >
    > 5c: Schema-awareness
    >
    > 5d: Others TBD
    >
    >
    >
    > 6. Concurrency
    >
    > 6a: What are the bottlenecks to achieving higher concurrency
    >
    > 6b: Ideas to address these..e.g async execution ?
    >
    >
    >
    > 7. Storage plugins,  REST APIs related enhancements
    >
    >   <Topics TBD>
    >
    >
    >
    > 8. Performance improvements
    >
    > 8a: Filter pushdown
    >
    > 8b: Vectorized Parquet reader
    >
    > 8c: Code-gen improvements
    >
    > 8d: Others TBD
    >
    >
    >
    >
    


Re: Drill 2.0 (design) hackathon

Posted by Aman Sinha <am...@apache.org>.
Here is the Eventbrite event for registration:

https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285

Please register so we can plan for food and drinks appropriately.

The link also contains a google doc link for the preliminary agenda and a
'Topics' tab with volunteer sign-up column.  Please add your name to the
area(s) of interest.

Thanks and look forward to seeing you all !

-Aman

On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <pr...@mapr.com> wrote:

> A partial list of Drill’s public APIs:
>
> IMHO, highest priority for Drill 2.0.
>
>
>   *   JDBC/ODBC drivers
>   *   Client (for JDBC/ODBC) + ODBC & JDBC
>   *   Client (for full Drill async, columnar)
>   *   Storage plugin
>   *   Format plugin
>   *   System/session options
>   *   Queueing (e.g. ZK-based queues)
>   *   Rest API
>   *   Resource Planning (e.g. max query memory per node)
>   *   Metadata access, storage (e.g. file system locations vs. a metastore)
>   *   Metadata files formats (Parquet, views, etc.)
>
> Lower priority for future releases:
>
>
>   *   Query Planning (e.g. Calcite rules)
>   *   Config options
>   *   SQL syntax, especially Drill extensions
>   *   UDF
>   *   Management (e.g. JMX, Rest API calls, etc.)
>   *   Drill File System (HDFS)
>   *   Web UI
>   *   Shell scripts
>
> There are certainly more. Please suggest those that are missing. I’ve
> taken a rough cut at which APIs need forward/backward compatibility first,
> in part based on those that are the “most public” and most likely to
> change. Others are important, but we can’t do them all at once.
>
> Thanks,
>
> - Paul
>
> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org>> wrote:
>
> Hi Paul,
> certainly makes sense to have the API compatibility discussions during this
> hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
> changes necessitating changes to the ODBC/JDBC drivers and other external
> applications. As part of this exercise (not during the hackathon but as a
> follow-up action), we also should clearly identify the "public" interfaces.
>
>
> I will add this to the agenda.
>
> thanks,
> -Aman
>
> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> progers@mapr.com>> wrote:
>
> Thanks Aman for organizing the Hackathon!
>
> The list included many good ideas for Drill 2.0. Some of those require
> changes to Drill’s “public” interfaces (file format, client protocol, SQL
> behavior, etc.)
>
> At present, Drill has no good mechanism to handle backward/forward
> compatibility at the API level. Protobuf versioning certainly helps, but
> can’t completely solve semantic changes (where a field changes meaning, or
> a non-Protobuf data chunk changes format.) As just one concrete example,
> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> names and data formats will change.
>
> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
> breaking changes that introduce a versioning mechanism into our public
> APIs. Once these are in place, we can evolve the APIs in the future by
> following the newly-created versioning protocol.
>
> Without such a mechanism, we cannot support old & new clients in the same
> cluster. Nor can we support rolling upgrades. Of course, another solution
> is to get it right the second time, then freeze all APIs and agree to never
> again change them. Not sure we have sufficient access to a crystal ball to
> predict everything we’d ever need in our APIs, however...
>
> Thanks,
>
> - Paul
>
> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org>> wrote:
>
> Drill Developers,
>
> In order to kick-start the Drill 2.0  release discussions, I would like
> to
> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
>
> As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> for most of you!
>
> The goal is to get the community together for a day-long technical
> discussion on key topics in preparation for a Drill 2.0 release as well
> as
> potential improvements in upcoming 1.xx releases.  Depending on the
> interest areas, we could form groups and have a volunteer lead each
> group.
>
> Based on prior discussions on the dev list, hangouts and existing JIRAs,
> there is already a substantial set of topics and I have summarized a few
> of
> them below.   What other topics do folks want to talk about?   Feel free
> to
> respond to this thread and I will create a google doc to consolidate.
> Understandably, the list would be long but we will use the hackathon to
> get
> a sense of a reasonable feature set for 1.xx and 2.0 releases.
>
>
> 1. Metadata management.
>
> 1a: Defining an abstraction layer for various types of metadata: views,
> schema, statistics, security
>
> 1b: Underlying storage for metadata: what are the options and their
> trade-offs?
>
>     - Hive metastore
>
>     - Parquet metadata cache (parquet specific)
>
>     - An embedded DBMS
>
>     - A distributed key-value store
>
>     - Others..
>
>
>
> 2. Drill integration with Apache Arrow
>
> 2a: Evaluate the choices and tradeoffs
>
>
>
> 3. Resource management
>
> 3a: Memory limits per query
>
> 3b: Spilling
>
> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
>
> 3d: Local vs. global resource management
>
> 3e: Aligning with admission control/queueing
>
>
>
> 4. TPC-DS coverage and related planner/operator enhancements
>
> 4a: Additional set operations: INTERSECT, EXCEPT
>
> 4b: GROUPING SETS, ROLLUP, CUBE support
>
> 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> (via Nested Loop Join)
>
> 4d: Remaining gaps in correlated subquery
>
> 4e: Statistics: Number of Distinct Values, Histograms
>
>
>
> 5. Schema handling
>
> 5a: Creation, management of schema
>
> 5b: Handling schema changes in certain common cases
>
> 5c: Schema-awareness
>
> 5d: Others TBD
>
>
>
> 6. Concurrency
>
> 6a: What are the bottlenecks to achieving higher concurrency
>
> 6b: Ideas to address these..e.g async execution ?
>
>
>
> 7. Storage plugins,  REST APIs related enhancements
>
>   <Topics TBD>
>
>
>
> 8. Performance improvements
>
> 8a: Filter pushdown
>
> 8b: Vectorized Parquet reader
>
> 8c: Code-gen improvements
>
> 8d: Others TBD
>
>
>
>

Re: Drill 2.0 (design) hackathon

Posted by Paul Rogers <pr...@mapr.com>.
A partial list of Drill’s public APIs:

IMHO, highest priority for Drill 2.0.


  *   JDBC/ODBC drivers
  *   Client (for JDBC/ODBC) + ODBC & JDBC
  *   Client (for full Drill async, columnar)
  *   Storage plugin
  *   Format plugin
  *   System/session options
  *   Queueing (e.g. ZK-based queues)
  *   Rest API
  *   Resource Planning (e.g. max query memory per node)
  *   Metadata access, storage (e.g. file system locations vs. a metastore)
  *   Metadata files formats (Parquet, views, etc.)

Lower priority for future releases:


  *   Query Planning (e.g. Calcite rules)
  *   Config options
  *   SQL syntax, especially Drill extensions
  *   UDF
  *   Management (e.g. JMX, Rest API calls, etc.)
  *   Drill File System (HDFS)
  *   Web UI
  *   Shell scripts

There are certainly more. Please suggest those that are missing. I’ve taken a rough cut at which APIs need forward/backward compatibility first, in part based on those that are the “most public” and most likely to change. Others are important, but we can’t do them all at once.

Thanks,

- Paul

On Aug 29, 2017, at 6:00 PM, Aman Sinha <am...@apache.org>> wrote:

Hi Paul,
certainly makes sense to have the API compatibility discussions during this
hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
changes necessitating changes to the ODBC/JDBC drivers and other external
applications. As part of this exercise (not during the hackathon but as a
follow-up action), we also should clearly identify the "public" interfaces.


I will add this to the agenda.

thanks,
-Aman

On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <pr...@mapr.com>> wrote:

Thanks Aman for organizing the Hackathon!

The list included many good ideas for Drill 2.0. Some of those require
changes to Drill’s “public” interfaces (file format, client protocol, SQL
behavior, etc.)

At present, Drill has no good mechanism to handle backward/forward
compatibility at the API level. Protobuf versioning certainly helps, but
can’t completely solve semantic changes (where a field changes meaning, or
a non-Protobuf data chunk changes format.) As just one concrete example,
changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
names and data formats will change.

Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
breaking changes that introduce a versioning mechanism into our public
APIs. Once these are in place, we can evolve the APIs in the future by
following the newly-created versioning protocol.

Without such a mechanism, we cannot support old & new clients in the same
cluster. Nor can we support rolling upgrades. Of course, another solution
is to get it right the second time, then freeze all APIs and agree to never
again change them. Not sure we have sufficient access to a crystal ball to
predict everything we’d ever need in our APIs, however...

Thanks,

- Paul

On Aug 24, 2017, at 8:39 AM, Aman Sinha <am...@apache.org>> wrote:

Drill Developers,

In order to kick-start the Drill 2.0  release discussions, I would like
to
propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).

As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
for most of you!

The goal is to get the community together for a day-long technical
discussion on key topics in preparation for a Drill 2.0 release as well
as
potential improvements in upcoming 1.xx releases.  Depending on the
interest areas, we could form groups and have a volunteer lead each
group.

Based on prior discussions on the dev list, hangouts and existing JIRAs,
there is already a substantial set of topics and I have summarized a few
of
them below.   What other topics do folks want to talk about?   Feel free
to
respond to this thread and I will create a google doc to consolidate.
Understandably, the list would be long but we will use the hackathon to
get
a sense of a reasonable feature set for 1.xx and 2.0 releases.


1. Metadata management.

1a: Defining an abstraction layer for various types of metadata: views,
schema, statistics, security

1b: Underlying storage for metadata: what are the options and their
trade-offs?

    - Hive metastore

    - Parquet metadata cache (parquet specific)

    - An embedded DBMS

    - A distributed key-value store

    - Others..



2. Drill integration with Apache Arrow

2a: Evaluate the choices and tradeoffs



3. Resource management

3a: Memory limits per query

3b: Spilling

3c: Resource management with Drill on Yarn/Mesos/Kubernetes

3d: Local vs. global resource management

3e: Aligning with admission control/queueing



4. TPC-DS coverage and related planner/operator enhancements

4a: Additional set operations: INTERSECT, EXCEPT

4b: GROUPING SETS, ROLLUP, CUBE support

4c: Handling inequality joins and cartesian joins of non-scalar inputs
(via Nested Loop Join)

4d: Remaining gaps in correlated subquery

4e: Statistics: Number of Distinct Values, Histograms



5. Schema handling

5a: Creation, management of schema

5b: Handling schema changes in certain common cases

5c: Schema-awareness

5d: Others TBD



6. Concurrency

6a: What are the bottlenecks to achieving higher concurrency

6b: Ideas to address these..e.g async execution ?



7. Storage plugins,  REST APIs related enhancements

  <Topics TBD>



8. Performance improvements

8a: Filter pushdown

8b: Vectorized Parquet reader

8c: Code-gen improvements

8d: Others TBD




Re: Drill 2.0 (design) hackathon

Posted by Aman Sinha <am...@apache.org>.
Hi Paul,
certainly makes sense to have the API compatibility discussions during this
hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
changes necessitating changes to the ODBC/JDBC drivers and other external
applications. As part of this exercise (not during the hackathon but as a
follow-up action), we also should clearly identify the "public" interfaces.


I will add this to the agenda.

thanks,
-Aman

On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <pr...@mapr.com> wrote:

> Thanks Aman for organizing the Hackathon!
>
> The list included many good ideas for Drill 2.0. Some of those require
> changes to Drill’s “public” interfaces (file format, client protocol, SQL
> behavior, etc.)
>
> At present, Drill has no good mechanism to handle backward/forward
> compatibility at the API level. Protobuf versioning certainly helps, but
> can’t completely solve semantic changes (where a field changes meaning, or
> a non-Protobuf data chunk changes format.) As just one concrete example,
> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> names and data formats will change.
>
> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
> breaking changes that introduce a versioning mechanism into our public
> APIs. Once these are in place, we can evolve the APIs in the future by
> following the newly-created versioning protocol.
>
> Without such a mechanism, we cannot support old & new clients in the same
> cluster. Nor can we support rolling upgrades. Of course, another solution
> is to get it right the second time, then freeze all APIs and agree to never
> again change them. Not sure we have sufficient access to a crystal ball to
> predict everything we’d ever need in our APIs, however...
>
> Thanks,
>
> - Paul
>
> > On Aug 24, 2017, at 8:39 AM, Aman Sinha <am...@apache.org> wrote:
> >
> > Drill Developers,
> >
> > In order to kick-start the Drill 2.0  release discussions, I would like
> to
> > propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
> >
> > As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> > Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> > for most of you!
> >
> > The goal is to get the community together for a day-long technical
> > discussion on key topics in preparation for a Drill 2.0 release as well
> as
> > potential improvements in upcoming 1.xx releases.  Depending on the
> > interest areas, we could form groups and have a volunteer lead each
> group.
> >
> > Based on prior discussions on the dev list, hangouts and existing JIRAs,
> > there is already a substantial set of topics and I have summarized a few
> of
> > them below.   What other topics do folks want to talk about?   Feel free
> to
> > respond to this thread and I will create a google doc to consolidate.
> > Understandably, the list would be long but we will use the hackathon to
> get
> > a sense of a reasonable feature set for 1.xx and 2.0 releases.
> >
> >
> > 1. Metadata management.
> >
> >  1a: Defining an abstraction layer for various types of metadata: views,
> > schema, statistics, security
> >
> >  1b: Underlying storage for metadata: what are the options and their
> > trade-offs?
> >
> >      - Hive metastore
> >
> >      - Parquet metadata cache (parquet specific)
> >
> >      - An embedded DBMS
> >
> >      - A distributed key-value store
> >
> >      - Others..
> >
> >
> >
> > 2. Drill integration with Apache Arrow
> >
> >  2a: Evaluate the choices and tradeoffs
> >
> >
> >
> > 3. Resource management
> >
> >  3a: Memory limits per query
> >
> >  3b: Spilling
> >
> >  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> >
> >  3d: Local vs. global resource management
> >
> >  3e: Aligning with admission control/queueing
> >
> >
> >
> > 4. TPC-DS coverage and related planner/operator enhancements
> >
> >  4a: Additional set operations: INTERSECT, EXCEPT
> >
> >  4b: GROUPING SETS, ROLLUP, CUBE support
> >
> >  4c: Handling inequality joins and cartesian joins of non-scalar inputs
> > (via Nested Loop Join)
> >
> >  4d: Remaining gaps in correlated subquery
> >
> >  4e: Statistics: Number of Distinct Values, Histograms
> >
> >
> >
> > 5. Schema handling
> >
> >  5a: Creation, management of schema
> >
> >  5b: Handling schema changes in certain common cases
> >
> >  5c: Schema-awareness
> >
> >  5d: Others TBD
> >
> >
> >
> > 6. Concurrency
> >
> >  6a: What are the bottlenecks to achieving higher concurrency
> >
> >  6b: Ideas to address these..e.g async execution ?
> >
> >
> >
> > 7. Storage plugins,  REST APIs related enhancements
> >
> >    <Topics TBD>
> >
> >
> >
> > 8. Performance improvements
> >
> >  8a: Filter pushdown
> >
> >  8b: Vectorized Parquet reader
> >
> >  8c: Code-gen improvements
> >
> >  8d: Others TBD
>
>

Re: Drill 2.0 (design) hackathon

Posted by Paul Rogers <pr...@mapr.com>.
Thanks Aman for organizing the Hackathon!

The list included many good ideas for Drill 2.0. Some of those require changes to Drill’s “public” interfaces (file format, client protocol, SQL behavior, etc.)

At present, Drill has no good mechanism to handle backward/forward compatibility at the API level. Protobuf versioning certainly helps, but can’t completely solve semantic changes (where a field changes meaning, or a non-Protobuf data chunk changes format.) As just one concrete example, changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class names and data formats will change. 

Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of breaking changes that introduce a versioning mechanism into our public APIs. Once these are in place, we can evolve the APIs in the future by following the newly-created versioning protocol.

Without such a mechanism, we cannot support old & new clients in the same cluster. Nor can we support rolling upgrades. Of course, another solution is to get it right the second time, then freeze all APIs and agree to never again change them. Not sure we have sufficient access to a crystal ball to predict everything we’d ever need in our APIs, however...

Thanks,

- Paul

> On Aug 24, 2017, at 8:39 AM, Aman Sinha <am...@apache.org> wrote:
> 
> Drill Developers,
> 
> In order to kick-start the Drill 2.0  release discussions, I would like to
> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
> 
> As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> for most of you!
> 
> The goal is to get the community together for a day-long technical
> discussion on key topics in preparation for a Drill 2.0 release as well as
> potential improvements in upcoming 1.xx releases.  Depending on the
> interest areas, we could form groups and have a volunteer lead each group.
> 
> Based on prior discussions on the dev list, hangouts and existing JIRAs,
> there is already a substantial set of topics and I have summarized a few of
> them below.   What other topics do folks want to talk about?   Feel free to
> respond to this thread and I will create a google doc to consolidate.
> Understandably, the list would be long but we will use the hackathon to get
> a sense of a reasonable feature set for 1.xx and 2.0 releases.
> 
> 
> 1. Metadata management.
> 
>  1a: Defining an abstraction layer for various types of metadata: views,
> schema, statistics, security
> 
>  1b: Underlying storage for metadata: what are the options and their
> trade-offs?
> 
>      - Hive metastore
> 
>      - Parquet metadata cache (parquet specific)
> 
>      - An embedded DBMS
> 
>      - A distributed key-value store
> 
>      - Others..
> 
> 
> 
> 2. Drill integration with Apache Arrow
> 
>  2a: Evaluate the choices and tradeoffs
> 
> 
> 
> 3. Resource management
> 
>  3a: Memory limits per query
> 
>  3b: Spilling
> 
>  3c: Resource management with Drill on Yarn/Mesos/Kubernetes
> 
>  3d: Local vs. global resource management
> 
>  3e: Aligning with admission control/queueing
> 
> 
> 
> 4. TPC-DS coverage and related planner/operator enhancements
> 
>  4a: Additional set operations: INTERSECT, EXCEPT
> 
>  4b: GROUPING SETS, ROLLUP, CUBE support
> 
>  4c: Handling inequality joins and cartesian joins of non-scalar inputs
> (via Nested Loop Join)
> 
>  4d: Remaining gaps in correlated subquery
> 
>  4e: Statistics: Number of Distinct Values, Histograms
> 
> 
> 
> 5. Schema handling
> 
>  5a: Creation, management of schema
> 
>  5b: Handling schema changes in certain common cases
> 
>  5c: Schema-awareness
> 
>  5d: Others TBD
> 
> 
> 
> 6. Concurrency
> 
>  6a: What are the bottlenecks to achieving higher concurrency
> 
>  6b: Ideas to address these..e.g async execution ?
> 
> 
> 
> 7. Storage plugins,  REST APIs related enhancements
> 
>    <Topics TBD>
> 
> 
> 
> 8. Performance improvements
> 
>  8a: Filter pushdown
> 
>  8b: Vectorized Parquet reader
> 
>  8c: Code-gen improvements
> 
>  8d: Others TBD