You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Arina Ielchiieva <ar...@apache.org> on 2018/08/13 12:41:53 UTC

[DISCUSSION] current project state

Hi all,

as a new PMC Chair I would like to thank users for choosing and using
Apache Drill and contributors /  committers for making improvements and
fixes. Recently Apache Drill 1.14 was released bundled up with many
improvements and new features. Please feel free to try it out and share
your experience. As always we would love to hear your success stories of
using Apache Drill.

Also I encourage users to share any problems found in Drill, as well as any
suggestions for future improvements. Feel free to start discussion on the
mailing list and then file a Jira with the summary. Contributions are
always welcome: minor, major, doc improvements or grammar fixes. Just file
a Jira and open the PR. Do not hesitate to ping developers on the mailing
list if PR is not being timely reviewed.

Latest project reports show:
Apache Drill project has healthy release schedule, each release includes
lots of features.
Mailing list (user / dev) are getting substantial support from the active
developers, including Stackoverflow and Twitter.
New committers are added on the steady basis.

Overall project is growing and moving forward. There have been discussions
about Drill 2.0 last year and currently Drill metastore feature is under
active investigation which might the breaking change for 2.0.

Please feel free to reply to this email with your comments / concerns /
ideas about current project state.

Kind regards,
Arina

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Derich,

From the shameless self promotion dept., Charles and I are wrapping up the O’Reilly book “Learning Apache Drill” that gives an in-depth discussion of format plugins and UDFs.

We still have a red for docs on storage plugins.

- Paul

Sent from my iPhone

> On Aug 27, 2018, at 9:04 AM, Carlos Derich <ca...@gmail.com> wrote:
> 
> Hello guys,
> 
> Thanks for bringing up this discussion, I may be a little bit late but I
> would like to add an use case I've been through recently.
> 
> I think Drill should be able to use ZK for storing session's data. In a
> multiple Drillbit scenario, if a second Drillbit receives a request with a
> session token attached, it should try to retrieve that session from ZK
> before generating a new session. I have seen this topic pop up every now
> and then, I'd be happy to work on it as a first contribution if we decide
> how this should work.
> 
> I would also like to add that we could try focus on improvements on
> documentation ? more specifically on docs that currently do not exist and
> also for writing custom storage plugins. I had to work recently on a custom
> storage plugin for GeoJSON files and I think the only resource I could find
> on this was Charle's log-file plugin (
> https://github.com/cgivre/drill-logfile-plugin) which I am incredible
> grateful for. I would be more than happy to work on these docs.
> 
> Derich.
> 
>> On Sun, Aug 19, 2018 at 4:49 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>> 
>> Hello Arina
>> 
>>> On Tue, Aug 14, 2018, at 4:08 PM, Arina Yelchiyeva wrote:
>>> 3. Drill vs Arrow is the topic I heard since I have started working with
>>> Drill. But so far nobody dared to tackle it. I would suspect Drill first
>>> would have to contribute changes in Arrow to be able to migrate which
>> could
>>> be a show-stopper if Arrow community does not accept them.
>> 
>> What would the changes that need to go into Arrow to make it usable for
>> Drill? I suspect that many of them should also align with the Arrow
>> project. Especially as the Java code of Arrow started out from Drill's
>> ValueVector code. If you already know some of the issues, it would be
>> really helpful to open tickets in the ARROW JIRA (feel free to add a drill
>> label to them, so one can search for them). Even if there is no plan to
>> implement them currently, it definitely helps us Arrow developers in
>> visibility what users of the Arrow library need / prevent from adoption.
>> 
>> Uwe
>>

Re: [DISCUSSION] current project state

Posted by Carlos Derich <ca...@gmail.com>.

Hello guys,

Thanks for bringing up this discussion, I may be a little bit late but I
would like to add an use case I've been through recently.

I think Drill should be able to use ZK for storing session's data. In a
multiple Drillbit scenario, if a second Drillbit receives a request with a
session token attached, it should try to retrieve that session from ZK
before generating a new session. I have seen this topic pop up every now
and then, I'd be happy to work on it as a first contribution if we decide
how this should work.

I would also like to add that we could try focus on improvements on
documentation ? more specifically on docs that currently do not exist and
also for writing custom storage plugins. I had to work recently on a custom
storage plugin for GeoJSON files and I think the only resource I could find
on this was Charle's log-file plugin (
https://github.com/cgivre/drill-logfile-plugin) which I am incredible
grateful for. I would be more than happy to work on these docs.

Derich.

On Sun, Aug 19, 2018 at 4:49 AM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Arina
>
> On Tue, Aug 14, 2018, at 4:08 PM, Arina Yelchiyeva wrote:
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
>
> What would the changes that need to go into Arrow to make it usable for
> Drill? I suspect that many of them should also align with the Arrow
> project. Especially as the Java code of Arrow started out from Drill's
> ValueVector code. If you already know some of the issues, it would be
> really helpful to open tickets in the ARROW JIRA (feel free to add a drill
> label to them, so one can search for them). Even if there is no plan to
> implement them currently, it definitely helps us Arrow developers in
> visibility what users of the Arrow library need / prevent from adoption.
>
> Uwe
>

Re: [DISCUSSION] current project state

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Arina

On Tue, Aug 14, 2018, at 4:08 PM, Arina Yelchiyeva wrote:
> 3. Drill vs Arrow is the topic I heard since I have started working with
> Drill. But so far nobody dared to tackle it. I would suspect Drill first
> would have to contribute changes in Arrow to be able to migrate which could
> be a show-stopper if Arrow community does not accept them.

What would the changes that need to go into Arrow to make it usable for Drill? I suspect that many of them should also align with the Arrow project. Especially as the Java code of Arrow started out from Drill's ValueVector code. If you already know some of the issues, it would be really helpful to open tickets in the ARROW JIRA (feel free to add a drill label to them, so one can search for them). Even if there is no plan to implement them currently, it definitely helps us Arrow developers in visibility what users of the Arrow library need / prevent from adoption.

Uwe

Re: [DISCUSSION] current project state

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Arina

On Tue, Aug 14, 2018, at 4:08 PM, Arina Yelchiyeva wrote:
> 3. Drill vs Arrow is the topic I heard since I have started working with
> Drill. But so far nobody dared to tackle it. I would suspect Drill first
> would have to contribute changes in Arrow to be able to migrate which could
> be a show-stopper if Arrow community does not accept them.

What would the changes that need to go into Arrow to make it usable for Drill? I suspect that many of them should also align with the Arrow project. Especially as the Java code of Arrow started out from Drill's ValueVector code. If you already know some of the issues, it would be really helpful to open tickets in the ARROW JIRA (feel free to add a drill label to them, so one can search for them). Even if there is no plan to implement them currently, it definitely helps us Arrow developers in visibility what users of the Arrow library need / prevent from adoption.

Uwe

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Joel,

Excellent suggestions! Thanks much for sharing your real-world experience; this is invaluable information for the project. Any others out there who can share their experience and what would help Drill be even before for their use case?

Arina, how do we want to capture and prioritize the ideas we've gotten from Joel and Weijie? Jira tickets? Some document with high-level themes that links to specific JIRA tickets?

The core development team has done great work, but it can't do everything. How might we identify those within the scope of the core team, and those that could use contributions from the community?

Thanks,
- Paul

 

    On Friday, August 17, 2018, 5:24:28 AM PDT, Joel Pfaff <jo...@gmail.com> wrote:  
 
 Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
 - It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

 - I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
 - We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
 - I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
 - Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
 - When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
 - While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
 - Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong <to...@gmail.com>
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some  evangelists to broadcast the Drill project  to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> > arina.yelchiyeva@gmail.com> wrote:
> >
> >  1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration.  I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data.  In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline.  Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing.  I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill.  @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed.  I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors /  committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Joel,

Excellent suggestions! Thanks much for sharing your real-world experience; this is invaluable information for the project. Any others out there who can share their experience and what would help Drill be even before for their use case?

Arina, how do we want to capture and prioritize the ideas we've gotten from Joel and Weijie? Jira tickets? Some document with high-level themes that links to specific JIRA tickets?

The core development team has done great work, but it can't do everything. How might we identify those within the scope of the core team, and those that could use contributions from the community?

Thanks,
- Paul

 

    On Friday, August 17, 2018, 5:24:28 AM PDT, Joel Pfaff <jo...@gmail.com> wrote:  
 
 Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
 - It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

 - I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
 - We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
 - I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
 - Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
 - When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
 - While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
 - Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong <to...@gmail.com>
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some  evangelists to broadcast the Drill project  to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> > arina.yelchiyeva@gmail.com> wrote:
> >
> >  1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration.  I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data.  In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline.  Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing.  I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill.  @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed.  I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors /  committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by Kunal Khatua <ku...@apache.org>.

I think apart from a lot of the improvements proposed, like metadata store (+1), having support for a materialized view would be especially useful.

The reason I am proposing this is that there are some storage plugins, for which Drill cannot pushdown filters, etc. Materialized views would allow a user to query, say, an S3 source that can be awfully slow, but allow for a more interactive querying on the materialized view.
The building blocks for this, creating temporary tables, already exist. So it is more of a matter of how to build out a UX that allows easier management of this interim dataset (like TTL, refresh policy, etc.)

While it's great that Drill can query so many sources, it can be a challenge to do interactive SQL with some of these sources due to their non-interactive nature.  

~ Kunal
On 8/17/2018 5:24:27 AM, Joel Pfaff <jo...@gmail.com> wrote:
Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
- It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

- I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
- We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
- I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
- Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
- When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
- While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
- Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some evangelists to broadcast the Drill project to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> > On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <>
> > arina.yelchiyeva@gmail.com> wrote:
> >
> > 1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration. I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data. In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline. Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins. A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing. I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill. @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed. I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets, If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > > On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <>
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > > On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <>
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors / committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by Kunal Khatua <ku...@apache.org>.

I think apart from a lot of the improvements proposed, like metadata store (+1), having support for a materialized view would be especially useful.

The reason I am proposing this is that there are some storage plugins, for which Drill cannot pushdown filters, etc. Materialized views would allow a user to query, say, an S3 source that can be awfully slow, but allow for a more interactive querying on the materialized view.
The building blocks for this, creating temporary tables, already exist. So it is more of a matter of how to build out a UX that allows easier management of this interim dataset (like TTL, refresh policy, etc.)

While it's great that Drill can query so many sources, it can be a challenge to do interactive SQL with some of these sources due to their non-interactive nature.  

~ Kunal
On 8/17/2018 5:24:27 AM, Joel Pfaff <jo...@gmail.com> wrote:
Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
- It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

- I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
- We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
- I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
- Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
- When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
- While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
- Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some evangelists to broadcast the Drill project to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> > On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <>
> > arina.yelchiyeva@gmail.com> wrote:
> >
> > 1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration. I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data. In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline. Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins. A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing. I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill. @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed. I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets, If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > > On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <>
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > > On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <>
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors / committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by Joel Pfaff <jo...@gmail.com>.

Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
 - It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

 - I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
 - We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
 - I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
 - Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
 - When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
 - While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
 - Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong <to...@gmail.com>
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some  evangelists to broadcast the Drill project  to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >     On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> > arina.yelchiyeva@gmail.com> wrote:
> >
> >  1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration.  I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data.  In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline.  Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing.  I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill.  @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed.  I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors /  committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by Joel Pfaff <jo...@gmail.com>.

Hello,

Thanks for reaching the community to open this discussion.
I have put a list of suggestions below, but before jumping to them. I think
there is something more important at stake.

To second Paul and Weijie, I think there is a need to define how we see
Drill in the future. Where do we see its strength, where do we see
weaknesses, and what to change/improve to put Drill as a leader for a
certain scope.

Suggestions (there is no specific priority):

* Metadata management improvement:
 - It has been discussed a lot in this thread and a few other ones. So I am
only going to focus on a few topics:
We currently do rely on Spark based ETL to generate the parquet files for
Drill. To be able to migrate to Drill's CTAS, we would need a better way to
manage the schemas and types (including the definition of complex types).

 - I have seen multiple cases where we would need a better way to manage
atomically a set of files as tables. As an example: our first
implementation in the application was based on recursively copying a
significant dataset, to have a safe place to remove/add files without
impact on ongoing queries. After this change is done, the newly created
directory is promoted as the new base directory for searches. The old
directory is kept as a way to implement searches on snapshots. Later on, an
improvement has been done to only rely on symlinks, but the implementation
became really complex, for a problem that could be solved more efficiently
on the querying engine layer. In that perspective, the Iceberg project is a
very interesting solution.

* SQL expressiveness:
 - We would also appreciate a SQL way to create workspaces (like a CREATE
DATABASE). And progressively the integration of additional
relational/analytical capabilities (INTERSECT/EXCEPT, and ROLLUP/CUBE).
 - I would like to have means to play a bit with the execution plans
without touching the session options at each query. So a support of query
hints could bring some interesting benefits.

* Resource Management:
 - Drill made nice improvements related to the management of local
resources (inside a DrillBit) with regards to CPU/Memory, but is still
lacking on the resource management at the cluster level.
Due to this limitation, we plan to manage one Drill cluster per functional
group. So a mechanism of admission control with priority would help us to
have to reduce the number of Drill clusters (and have bigger ones).

* Resiliency:
 - When the clusters get large, the probability of losing nodes increases.
It would be helpful to support resuming/restarting queries that were
executing on a node that failed.

* Hosting:
 - While we do store everything currently on the Hadoop clusters, we expect
to move some of the historical storage to object stores. And we may have
cases where the functional group handling the data does not need a Hadoop
cluster, but still would like to be able to search through his dataset
stored on the Object Store. As a longer term goal, being able to run Drill
on Kubernetes would allow us to give the Drill power to these users on a
shared infrastucture.

* Drill Extensions:
 - Several querying enging support the inlining of UDF with the query. In
Spark, the Querying DSL allows for Lambdas to be serialized with the query
(in Java/Scala/Python/R).
In BigQuery, the SQL supports primitives to inline JavaScript UDF (I found
it crazy at first, but after a second thought, I found JS to be an
excellent idea as a lightweight sandbox for untrusted code).
For these kinds of extensions, a migration to Arrow as an internal value
vector representation, would provide the basic libraries to exchange data
between the interpreters.

Regards, Joel

On Wed, Aug 15, 2018 at 12:44 AM weijie tong <to...@gmail.com>
wrote:

> My thinking about this topic. Drill does well now. But be better,we need to
> be idealist to bring in more use cases or more advanced query performance
> compared to other projects like Flink , Spark, Presto,Impala. To
> performance, I wonder do we need to adopt the project Gandiva which is so
> exciting or we does our own similar implementation without migrating to
> Arrow. If we choose to adopt it,we should go about migrating to Arrow.
> Arrow is good at with some other language implementations which give more
> options to do optimization like Gandiva does , python UDF is also easy to
> do.
>
> We also need some  evangelists to broadcast the Drill project  to adopt
> more contributors.
> It’s rarely to see Drill’s tech show to expand its community influence.
>
> On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > I wonder if we should pop the discussion up a level? What goals should
> > Drill have as an Apache project?
> >
> > Drill is a big data query engine, and shares that description with
> Impala,
> > Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> > lower than Impala or Spark. What unique use cases can Drill address that
> > are unserved (or under-served) by the Impala and Spark juggernauts?
> >
> > To grow, Drill must define its sweet spot: what it does better than any
> > other project. Let's identify why organizations might want to use Drill
> > rather than (or in addition to) the better-known alternatives. Answer
> that,
> > and we enter a virtuous cycle: organizations will adopt Drill because it
> > does things that other tools don't do (or do poorly). Some of those
> > adopters will want to contribute to Drill for new use cases, which will
> > encourage more adoption.
> >
> > Drill is like many other projects in their early years: one core vendor
> > has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> > Kudu and Kafka are other examples.) Naturally, the work of the core team
> > focuses on the specific needs of that vendor's customers. Ideally, Drill
> > would, like those other tools, gain sufficient adoption that many other
> > organizations contribute as well, broadening the set of supported use
> > cases, and entering that virtuous growth cycle, to everyone's benefit.
> >
> >
> > The core question: what does the community see as gaps in their big data
> > stacks that Drill can serve?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >     On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> > arina.yelchiyeva@gmail.com> wrote:
> >
> >  1. Regarding Drill metastore, its under investigation, please follow up
> > with DRILL-6552.
> > 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> > Definitely, it could have been done easier but even for current state we
> > have good manuals. Regarding adding support for different languages like
> > python, that would require full re-write on UDFs code handling, since
> Drill
> > heavily relies on Java source code when during UDFs initialization.
> Though
> > generally it's a good idea since, Hive, for example, supports Scala,
> Python
> > for UDFs.
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
> >
> > On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
> >
> > > I’d like to weigh in here as well. As a long time user of Drill, I
> really
> > > would like to see more people using it and I think there are a few key
> > > aspects that could really help on that front.
> > >
> > > The first of which is the Arrow integration.  I’m not enough of a
> > software
> > > engineer to understand all the internal details here, but as I
> understand
> > > it, the promise of Arrow is that many tools will share a common memory
> > > model and that it will be possible to transfer data from one tool to
> the
> > > other without having to serialize/deserialize the data.  In the data
> > > science community many of the major platforms, Python-pandas, R, and
> > Spark
> > > are moving or have adopted Arrow.
> > > Drill’s strength is the ease that it can query many different data
> > sources
> > > and if Drill were to adopt Arrow, I suspect that many people would
> adopt
> > it
> > > as a part of a machine learning pipeline.  Just recently, I attempted
> to
> > do
> > > some data manipulation using Spark, and couldn’t help but notice how
> > > difficult ti was in contrast with Drill. I’m sure this is a very
> complex
> > > task, but I do think that it could be worth it in the end.
> > >
> > > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> > its
> > > extensibility and making it easier would be a great thing.  I was
> > wondering
> > > whether it would be possible or even a good idea, to enable users to
> > write
> > > UDFs in a scripting language such as python.
> > >
> > > Thirdly,
> > > i would really like to see us add more functionality to Drill.  @Arina,
> > > your work to build a storage plugin for ElasticSearch is really great
> > and I
> > > think more capabilities like that are really needed.  I’d like to see a
> > > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> > can
> > > figure out how storage plugins work, I’ll gladly work on some of these.
> > >
> > > Just my .02.
> > > — C
> > >
> > >
> > >
> > >
> > >
> > > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > > wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Another topic would be whether/how to round out Drill's data model.
> > > Drill's scalar and nullable types are pretty solid. Great work was done
> > > recently for Decimal (though the old types still remain.) Good support
> is
> > > now available for nested types to do implicit joins to produce
> > SQL-friendly
> > > flat records.
> > > > But, opportunities for improvement still remain. Date/Time has
> timezone
> > > issues. Union, List and Repeated List never quite worked. There are a
> few
> > > types identified in the code, but not implemented (dates with TZ, tiny
> > > ints, etc.) How should Drill bridge. the gap from arrays and maps
> > (really,
> > > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools
> on
> > > the other?
> > > >
> > > > Would be good to finalize the data types and their mapping to plain
> > SQL:
> > > either keep a type and make it fully work if it has holes, or drop it.
> > > Unions and Lists are the messiest. They are incomplete in part, because
> > > they are trying to do the impossible: to predict the future well enough
> > > that Drill can handle columns with varying or ambiguous data types
> (that
> > > is, to handle schema changes.) Is there a better way to handle this
> issue
> > > (such as with metadata hints)? That is, rather than fight with
> > conflicting
> > > types at run time, simply declare the common type in metadata so all
> > > operators and record batches agree on the type.
> > > >
> > > > And, of course, there is the lingering issue of Drill vectors vs.
> > Arrow.
> > > Arrow did great work in metadata, but seems to have kept some of the
> > > awkward aspects of Drill's original memory model (lack of control over
> > > batch sizes, ability to fragment memory.) Might there be a resyncing of
> > the
> > > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > > Drill's memory improvements, such as the size-limiting "result set
> > loader"
> > > framework.
> > > >
> > > > Big-picture issues such as this tend to get lost in the 2270 open
> Jira
> > > tickets. How might the project create some "theme" tickets (or Wiki
> pages
> > > or whatever) to help pull the main issues out of the wealth of detail
> in
> > > Jira?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > > par0328@yahoo.com> wrote:
> > > >
> > > > Hi Arina,
> > > >
> > > > Thanks for launching this discussion. A few minor suggestions.
> > > >
> > > > The developers have done a fantastic job stabilizing and improving
> > > Drill's core functionality. Now the opportunity is to expand the use
> > cases
> > > for Drill so that it gets wider adoption within the community. Drill
> > > competes for mindshare with Impala, Presto, Hive, Spark and others. A
> key
> > > differentiator for Drill can be the ability to extend the core and
> > > integrate Drill into user applications. Of these tools, only Spark has
> a
> > > fully ostensible model. Can Drill provide some of the flexibility that
> > has
> > > powered Spark to success?
> > > >
> > > > 1. You mentioned the metastore is under active investigation.
> Anything
> > > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> > key
> > > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > > errors that Drill was meant to address. Maybe we can toss around ideas
> > for
> > > a metadata API that provides greater flexibility.
> > > >
> > > > 2. Users can extend the core with custom UDFs, storage engines,
> formats
> > > and so on. At present, the code to do this is rather hard to write,
> debug
> > > and maintain. Is there value in streamlining those interfaces so that a
> > > wider audience can extend Drill for their specific needs?
> > > >
> > > > 3. Similarly, we've seen interest in integrating Drill with other
> > > systems, which suggests an opportunity for improved APIs. Ability to
> > > associate options, defaults and restrictions with users. Ability to use
> > the
> > > REST API for larger data sets and with stateful session options. And so
> > on.
> > > >
> > > > Such extensions are best guided by user demands: what can Drill
> provide
> > > for production applications to enable simpler/faster/more complete
> > > integration?
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > > arina@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > as a new PMC Chair I would like to thank users for choosing and using
> > > > Apache Drill and contributors /  committers for making improvements
> and
> > > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > > improvements and new features. Please feel free to try it out and
> share
> > > > your experience. As always we would love to hear your success stories
> > of
> > > > using Apache Drill.
> > > >
> > > > Also I encourage users to share any problems found in Drill, as well
> as
> > > any
> > > > suggestions for future improvements. Feel free to start discussion on
> > the
> > > > mailing list and then file a Jira with the summary. Contributions are
> > > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > > file
> > > > a Jira and open the PR. Do not hesitate to ping developers on the
> > mailing
> > > > list if PR is not being timely reviewed.
> > > >
> > > > Latest project reports show:
> > > > Apache Drill project has healthy release schedule, each release
> > includes
> > > > lots of features.
> > > > Mailing list (user / dev) are getting substantial support from the
> > active
> > > > developers, including Stackoverflow and Twitter.
> > > > New committers are added on the steady basis.
> > > >
> > > > Overall project is growing and moving forward. There have been
> > > discussions
> > > > about Drill 2.0 last year and currently Drill metastore feature is
> > under
> > > > active investigation which might the breaking change for 2.0.
> > > >
> > > > Please feel free to reply to this email with your comments /
> concerns /
> > > > ideas about current project state.
> > > >
> > > > Kind regards,
> > > > Arina
> > >
> > >
>

Re: [DISCUSSION] current project state

Posted by weijie tong <to...@gmail.com>.

My thinking about this topic. Drill does well now. But be better,we need to
be idealist to bring in more use cases or more advanced query performance
compared to other projects like Flink , Spark, Presto,Impala. To
performance, I wonder do we need to adopt the project Gandiva which is so
exciting or we does our own similar implementation without migrating to
Arrow. If we choose to adopt it,we should go about migrating to Arrow.
Arrow is good at with some other language implementations which give more
options to do optimization like Gandiva does , python UDF is also easy to
do.

We also need some  evangelists to broadcast the Drill project  to adopt
more contributors.
It’s rarely to see Drill’s tech show to expand its community influence.

On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> I wonder if we should pop the discussion up a level? What goals should
> Drill have as an Apache project?
>
> Drill is a big data query engine, and shares that description with Impala,
> Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> lower than Impala or Spark. What unique use cases can Drill address that
> are unserved (or under-served) by the Impala and Spark juggernauts?
>
> To grow, Drill must define its sweet spot: what it does better than any
> other project. Let's identify why organizations might want to use Drill
> rather than (or in addition to) the better-known alternatives. Answer that,
> and we enter a virtuous cycle: organizations will adopt Drill because it
> does things that other tools don't do (or do poorly). Some of those
> adopters will want to contribute to Drill for new use cases, which will
> encourage more adoption.
>
> Drill is like many other projects in their early years: one core vendor
> has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> Kudu and Kafka are other examples.) Naturally, the work of the core team
> focuses on the specific needs of that vendor's customers. Ideally, Drill
> would, like those other tools, gain sufficient adoption that many other
> organizations contribute as well, broadening the set of supported use
> cases, and entering that virtuous growth cycle, to everyone's benefit.
>
>
> The core question: what does the community see as gaps in their big data
> stacks that Drill can serve?
>
> Thanks,
>
> - Paul
>
>
>
>     On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com> wrote:
>
>  1. Regarding Drill metastore, its under investigation, please follow up
> with DRILL-6552.
> 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> Definitely, it could have been done easier but even for current state we
> have good manuals. Regarding adding support for different languages like
> python, that would require full re-write on UDFs code handling, since Drill
> heavily relies on Java source code when during UDFs initialization. Though
> generally it's a good idea since, Hive, for example, supports Scala, Python
> for UDFs.
> 3. Drill vs Arrow is the topic I heard since I have started working with
> Drill. But so far nobody dared to tackle it. I would suspect Drill first
> would have to contribute changes in Arrow to be able to migrate which could
> be a show-stopper if Arrow community does not accept them.
>
> On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
>
> > I’d like to weigh in here as well. As a long time user of Drill, I really
> > would like to see more people using it and I think there are a few key
> > aspects that could really help on that front.
> >
> > The first of which is the Arrow integration.  I’m not enough of a
> software
> > engineer to understand all the internal details here, but as I understand
> > it, the promise of Arrow is that many tools will share a common memory
> > model and that it will be possible to transfer data from one tool to the
> > other without having to serialize/deserialize the data.  In the data
> > science community many of the major platforms, Python-pandas, R, and
> Spark
> > are moving or have adopted Arrow.
> > Drill’s strength is the ease that it can query many different data
> sources
> > and if Drill were to adopt Arrow, I suspect that many people would adopt
> it
> > as a part of a machine learning pipeline.  Just recently, I attempted to
> do
> > some data manipulation using Spark, and couldn’t help but notice how
> > difficult ti was in contrast with Drill. I’m sure this is a very complex
> > task, but I do think that it could be worth it in the end.
> >
> > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> its
> > extensibility and making it easier would be a great thing.  I was
> wondering
> > whether it would be possible or even a good idea, to enable users to
> write
> > UDFs in a scripting language such as python.
> >
> > Thirdly,
> > i would really like to see us add more functionality to Drill.  @Arina,
> > your work to build a storage plugin for ElasticSearch is really great
> and I
> > think more capabilities like that are really needed.  I’d like to see a
> > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> can
> > figure out how storage plugins work, I’ll gladly work on some of these.
> >
> > Just my .02.
> > — C
> >
> >
> >
> >
> >
> > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > wrote:
> > >
> > > Hi Arina,
> > >
> > > Another topic would be whether/how to round out Drill's data model.
> > Drill's scalar and nullable types are pretty solid. Great work was done
> > recently for Decimal (though the old types still remain.) Good support is
> > now available for nested types to do implicit joins to produce
> SQL-friendly
> > flat records.
> > > But, opportunities for improvement still remain. Date/Time has timezone
> > issues. Union, List and Repeated List never quite worked. There are a few
> > types identified in the code, but not implemented (dates with TZ, tiny
> > ints, etc.) How should Drill bridge. the gap from arrays and maps
> (really,
> > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> > the other?
> > >
> > > Would be good to finalize the data types and their mapping to plain
> SQL:
> > either keep a type and make it fully work if it has holes, or drop it.
> > Unions and Lists are the messiest. They are incomplete in part, because
> > they are trying to do the impossible: to predict the future well enough
> > that Drill can handle columns with varying or ambiguous data types (that
> > is, to handle schema changes.) Is there a better way to handle this issue
> > (such as with metadata hints)? That is, rather than fight with
> conflicting
> > types at run time, simply declare the common type in metadata so all
> > operators and record batches agree on the type.
> > >
> > > And, of course, there is the lingering issue of Drill vectors vs.
> Arrow.
> > Arrow did great work in metadata, but seems to have kept some of the
> > awkward aspects of Drill's original memory model (lack of control over
> > batch sizes, ability to fragment memory.) Might there be a resyncing of
> the
> > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > Drill's memory improvements, such as the size-limiting "result set
> loader"
> > framework.
> > >
> > > Big-picture issues such as this tend to get lost in the 2270 open Jira
> > tickets. How might the project create some "theme" tickets (or Wiki pages
> > or whatever) to help pull the main issues out of the wealth of detail in
> > Jira?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > par0328@yahoo.com> wrote:
> > >
> > > Hi Arina,
> > >
> > > Thanks for launching this discussion. A few minor suggestions.
> > >
> > > The developers have done a fantastic job stabilizing and improving
> > Drill's core functionality. Now the opportunity is to expand the use
> cases
> > for Drill so that it gets wider adoption within the community. Drill
> > competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> > differentiator for Drill can be the ability to extend the core and
> > integrate Drill into user applications. Of these tools, only Spark has a
> > fully ostensible model. Can Drill provide some of the flexibility that
> has
> > powered Spark to success?
> > >
> > > 1. You mentioned the metastore is under active investigation. Anything
> > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> key
> > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > errors that Drill was meant to address. Maybe we can toss around ideas
> for
> > a metadata API that provides greater flexibility.
> > >
> > > 2. Users can extend the core with custom UDFs, storage engines, formats
> > and so on. At present, the code to do this is rather hard to write, debug
> > and maintain. Is there value in streamlining those interfaces so that a
> > wider audience can extend Drill for their specific needs?
> > >
> > > 3. Similarly, we've seen interest in integrating Drill with other
> > systems, which suggests an opportunity for improved APIs. Ability to
> > associate options, defaults and restrictions with users. Ability to use
> the
> > REST API for larger data sets and with stateful session options. And so
> on.
> > >
> > > Such extensions are best guided by user demands: what can Drill provide
> > for production applications to enable simpler/faster/more complete
> > integration?
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > arina@apache.org> wrote:
> > >
> > > Hi all,
> > >
> > > as a new PMC Chair I would like to thank users for choosing and using
> > > Apache Drill and contributors /  committers for making improvements and
> > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > improvements and new features. Please feel free to try it out and share
> > > your experience. As always we would love to hear your success stories
> of
> > > using Apache Drill.
> > >
> > > Also I encourage users to share any problems found in Drill, as well as
> > any
> > > suggestions for future improvements. Feel free to start discussion on
> the
> > > mailing list and then file a Jira with the summary. Contributions are
> > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > file
> > > a Jira and open the PR. Do not hesitate to ping developers on the
> mailing
> > > list if PR is not being timely reviewed.
> > >
> > > Latest project reports show:
> > > Apache Drill project has healthy release schedule, each release
> includes
> > > lots of features.
> > > Mailing list (user / dev) are getting substantial support from the
> active
> > > developers, including Stackoverflow and Twitter.
> > > New committers are added on the steady basis.
> > >
> > > Overall project is growing and moving forward. There have been
> > discussions
> > > about Drill 2.0 last year and currently Drill metastore feature is
> under
> > > active investigation which might the breaking change for 2.0.
> > >
> > > Please feel free to reply to this email with your comments / concerns /
> > > ideas about current project state.
> > >
> > > Kind regards,
> > > Arina
> >
> >

Re: [DISCUSSION] current project state

Posted by weijie tong <to...@gmail.com>.

My thinking about this topic. Drill does well now. But be better,we need to
be idealist to bring in more use cases or more advanced query performance
compared to other projects like Flink , Spark, Presto,Impala. To
performance, I wonder do we need to adopt the project Gandiva which is so
exciting or we does our own similar implementation without migrating to
Arrow. If we choose to adopt it,we should go about migrating to Arrow.
Arrow is good at with some other language implementations which give more
options to do optimization like Gandiva does , python UDF is also easy to
do.

We also need some  evangelists to broadcast the Drill project  to adopt
more contributors.
It’s rarely to see Drill’s tech show to expand its community influence.

On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> I wonder if we should pop the discussion up a level? What goals should
> Drill have as an Apache project?
>
> Drill is a big data query engine, and shares that description with Impala,
> Presto, Hive and (to some degree) Spark. Drill's adoption is currently
> lower than Impala or Spark. What unique use cases can Drill address that
> are unserved (or under-served) by the Impala and Spark juggernauts?
>
> To grow, Drill must define its sweet spot: what it does better than any
> other project. Let's identify why organizations might want to use Drill
> rather than (or in addition to) the better-known alternatives. Answer that,
> and we enter a virtuous cycle: organizations will adopt Drill because it
> does things that other tools don't do (or do poorly). Some of those
> adopters will want to contribute to Drill for new use cases, which will
> encourage more adoption.
>
> Drill is like many other projects in their early years: one core vendor
> has graciously contributed the bulk of the code. (Impala, Hadoop, Spark,
> Kudu and Kafka are other examples.) Naturally, the work of the core team
> focuses on the specific needs of that vendor's customers. Ideally, Drill
> would, like those other tools, gain sufficient adoption that many other
> organizations contribute as well, broadening the set of supported use
> cases, and entering that virtuous growth cycle, to everyone's benefit.
>
>
> The core question: what does the community see as gaps in their big data
> stacks that Drill can serve?
>
> Thanks,
>
> - Paul
>
>
>
>     On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <
> arina.yelchiyeva@gmail.com> wrote:
>
>  1. Regarding Drill metastore, its under investigation, please follow up
> with DRILL-6552.
> 2. UDFs: I would not say, it's that quit to write UDFs in Drill.
> Definitely, it could have been done easier but even for current state we
> have good manuals. Regarding adding support for different languages like
> python, that would require full re-write on UDFs code handling, since Drill
> heavily relies on Java source code when during UDFs initialization. Though
> generally it's a good idea since, Hive, for example, supports Scala, Python
> for UDFs.
> 3. Drill vs Arrow is the topic I heard since I have started working with
> Drill. But so far nobody dared to tackle it. I would suspect Drill first
> would have to contribute changes in Arrow to be able to migrate which could
> be a show-stopper if Arrow community does not accept them.
>
> On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:
>
> > I’d like to weigh in here as well. As a long time user of Drill, I really
> > would like to see more people using it and I think there are a few key
> > aspects that could really help on that front.
> >
> > The first of which is the Arrow integration.  I’m not enough of a
> software
> > engineer to understand all the internal details here, but as I understand
> > it, the promise of Arrow is that many tools will share a common memory
> > model and that it will be possible to transfer data from one tool to the
> > other without having to serialize/deserialize the data.  In the data
> > science community many of the major platforms, Python-pandas, R, and
> Spark
> > are moving or have adopted Arrow.
> > Drill’s strength is the ease that it can query many different data
> sources
> > and if Drill were to adopt Arrow, I suspect that many people would adopt
> it
> > as a part of a machine learning pipeline.  Just recently, I attempted to
> do
> > some data manipulation using Spark, and couldn’t help but notice how
> > difficult ti was in contrast with Drill. I’m sure this is a very complex
> > task, but I do think that it could be worth it in the end.
> >
> > Secondly, I’d like to second Paul’s call to simplify the interfaces for
> > UDFs, Format and ideally storage plugins.  A core strength of Drill is
> its
> > extensibility and making it easier would be a great thing.  I was
> wondering
> > whether it would be possible or even a good idea, to enable users to
> write
> > UDFs in a scripting language such as python.
> >
> > Thirdly,
> > i would really like to see us add more functionality to Drill.  @Arina,
> > your work to build a storage plugin for ElasticSearch is really great
> and I
> > think more capabilities like that are really needed.  I’d like to see a
> > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I
> can
> > figure out how storage plugins work, I’ll gladly work on some of these.
> >
> > Just my .02.
> > — C
> >
> >
> >
> >
> >
> > > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> > wrote:
> > >
> > > Hi Arina,
> > >
> > > Another topic would be whether/how to round out Drill's data model.
> > Drill's scalar and nullable types are pretty solid. Great work was done
> > recently for Decimal (though the old types still remain.) Good support is
> > now available for nested types to do implicit joins to produce
> SQL-friendly
> > flat records.
> > > But, opportunities for improvement still remain. Date/Time has timezone
> > issues. Union, List and Repeated List never quite worked. There are a few
> > types identified in the code, but not implemented (dates with TZ, tiny
> > ints, etc.) How should Drill bridge. the gap from arrays and maps
> (really,
> > structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> > the other?
> > >
> > > Would be good to finalize the data types and their mapping to plain
> SQL:
> > either keep a type and make it fully work if it has holes, or drop it.
> > Unions and Lists are the messiest. They are incomplete in part, because
> > they are trying to do the impossible: to predict the future well enough
> > that Drill can handle columns with varying or ambiguous data types (that
> > is, to handle schema changes.) Is there a better way to handle this issue
> > (such as with metadata hints)? That is, rather than fight with
> conflicting
> > types at run time, simply declare the common type in metadata so all
> > operators and record batches agree on the type.
> > >
> > > And, of course, there is the lingering issue of Drill vectors vs.
> Arrow.
> > Arrow did great work in metadata, but seems to have kept some of the
> > awkward aspects of Drill's original memory model (lack of control over
> > batch sizes, ability to fragment memory.) Might there be a resyncing of
> the
> > two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> > Drill's memory improvements, such as the size-limiting "result set
> loader"
> > framework.
> > >
> > > Big-picture issues such as this tend to get lost in the 2270 open Jira
> > tickets. How might the project create some "theme" tickets (or Wiki pages
> > or whatever) to help pull the main issues out of the wealth of detail in
> > Jira?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> > par0328@yahoo.com> wrote:
> > >
> > > Hi Arina,
> > >
> > > Thanks for launching this discussion. A few minor suggestions.
> > >
> > > The developers have done a fantastic job stabilizing and improving
> > Drill's core functionality. Now the opportunity is to expand the use
> cases
> > for Drill so that it gets wider adoption within the community. Drill
> > competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> > differentiator for Drill can be the ability to extend the core and
> > integrate Drill into user applications. Of these tools, only Spark has a
> > fully ostensible model. Can Drill provide some of the flexibility that
> has
> > powered Spark to success?
> > >
> > > 1. You mentioned the metastore is under active investigation. Anything
> > yet to share? Didn't see any activity on the JIRA ticket. Metadata is a
> key
> > gap in Drill. Simply adding a Hive-like metastore would repeat the very
> > errors that Drill was meant to address. Maybe we can toss around ideas
> for
> > a metadata API that provides greater flexibility.
> > >
> > > 2. Users can extend the core with custom UDFs, storage engines, formats
> > and so on. At present, the code to do this is rather hard to write, debug
> > and maintain. Is there value in streamlining those interfaces so that a
> > wider audience can extend Drill for their specific needs?
> > >
> > > 3. Similarly, we've seen interest in integrating Drill with other
> > systems, which suggests an opportunity for improved APIs. Ability to
> > associate options, defaults and restrictions with users. Ability to use
> the
> > REST API for larger data sets and with stateful session options. And so
> on.
> > >
> > > Such extensions are best guided by user demands: what can Drill provide
> > for production applications to enable simpler/faster/more complete
> > integration?
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> > arina@apache.org> wrote:
> > >
> > > Hi all,
> > >
> > > as a new PMC Chair I would like to thank users for choosing and using
> > > Apache Drill and contributors /  committers for making improvements and
> > > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > > improvements and new features. Please feel free to try it out and share
> > > your experience. As always we would love to hear your success stories
> of
> > > using Apache Drill.
> > >
> > > Also I encourage users to share any problems found in Drill, as well as
> > any
> > > suggestions for future improvements. Feel free to start discussion on
> the
> > > mailing list and then file a Jira with the summary. Contributions are
> > > always welcome: minor, major, doc improvements or grammar fixes. Just
> > file
> > > a Jira and open the PR. Do not hesitate to ping developers on the
> mailing
> > > list if PR is not being timely reviewed.
> > >
> > > Latest project reports show:
> > > Apache Drill project has healthy release schedule, each release
> includes
> > > lots of features.
> > > Mailing list (user / dev) are getting substantial support from the
> active
> > > developers, including Stackoverflow and Twitter.
> > > New committers are added on the steady basis.
> > >
> > > Overall project is growing and moving forward. There have been
> > discussions
> > > about Drill 2.0 last year and currently Drill metastore feature is
> under
> > > active investigation which might the breaking change for 2.0.
> > >
> > > Please feel free to reply to this email with your comments / concerns /
> > > ideas about current project state.
> > >
> > > Kind regards,
> > > Arina
> >
> >

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

I wonder if we should pop the discussion up a level? What goals should Drill have as an Apache project?

Drill is a big data query engine, and shares that description with Impala, Presto, Hive and (to some degree) Spark. Drill's adoption is currently lower than Impala or Spark. What unique use cases can Drill address that are unserved (or under-served) by the Impala and Spark juggernauts?

To grow, Drill must define its sweet spot: what it does better than any other project. Let's identify why organizations might want to use Drill rather than (or in addition to) the better-known alternatives. Answer that, and we enter a virtuous cycle: organizations will adopt Drill because it does things that other tools don't do (or do poorly). Some of those adopters will want to contribute to Drill for new use cases, which will encourage more adoption.

Drill is like many other projects in their early years: one core vendor has graciously contributed the bulk of the code. (Impala, Hadoop, Spark, Kudu and Kafka are other examples.) Naturally, the work of the core team focuses on the specific needs of that vendor's customers. Ideally, Drill would, like those other tools, gain sufficient adoption that many other organizations contribute as well, broadening the set of supported use cases, and entering that virtuous growth cycle, to everyone's benefit.


The core question: what does the community see as gaps in their big data stacks that Drill can serve?

Thanks,

- Paul

 

    On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <ar...@gmail.com> wrote:  
 
 1. Regarding Drill metastore, its under investigation, please follow up
with DRILL-6552.
2. UDFs: I would not say, it's that quit to write UDFs in Drill.
Definitely, it could have been done easier but even for current state we
have good manuals. Regarding adding support for different languages like
python, that would require full re-write on UDFs code handling, since Drill
heavily relies on Java source code when during UDFs initialization. Though
generally it's a good idea since, Hive, for example, supports Scala, Python
for UDFs.
3. Drill vs Arrow is the topic I heard since I have started working with
Drill. But so far nobody dared to tackle it. I would suspect Drill first
would have to contribute changes in Arrow to be able to migrate which could
be a show-stopper if Arrow community does not accept them.

On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:

> I’d like to weigh in here as well. As a long time user of Drill, I really
> would like to see more people using it and I think there are a few key
> aspects that could really help on that front.
>
> The first of which is the Arrow integration.  I’m not enough of a software
> engineer to understand all the internal details here, but as I understand
> it, the promise of Arrow is that many tools will share a common memory
> model and that it will be possible to transfer data from one tool to the
> other without having to serialize/deserialize the data.  In the data
> science community many of the major platforms, Python-pandas, R, and Spark
> are moving or have adopted Arrow.
> Drill’s strength is the ease that it can query many different data sources
> and if Drill were to adopt Arrow, I suspect that many people would adopt it
> as a part of a machine learning pipeline.  Just recently, I attempted to do
> some data manipulation using Spark, and couldn’t help but notice how
> difficult ti was in contrast with Drill. I’m sure this is a very complex
> task, but I do think that it could be worth it in the end.
>
> Secondly, I’d like to second Paul’s call to simplify the interfaces for
> UDFs, Format and ideally storage plugins.  A core strength of Drill is its
> extensibility and making it easier would be a great thing.  I was wondering
> whether it would be possible or even a good idea, to enable users to write
> UDFs in a scripting language such as python.
>
> Thirdly,
> i would really like to see us add more functionality to Drill.  @Arina,
> your work to build a storage plugin for ElasticSearch is really great and I
> think more capabilities like that are really needed.  I’d like to see a
> generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can
> figure out how storage plugins work, I’ll gladly work on some of these.
>
> Just my .02.
> — C
>
>
>
>
>
> > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Arina,
> >
> > Another topic would be whether/how to round out Drill's data model.
> Drill's scalar and nullable types are pretty solid. Great work was done
> recently for Decimal (though the old types still remain.) Good support is
> now available for nested types to do implicit joins to produce SQL-friendly
> flat records.
> > But, opportunities for improvement still remain. Date/Time has timezone
> issues. Union, List and Repeated List never quite worked. There are a few
> types identified in the code, but not implemented (dates with TZ, tiny
> ints, etc.) How should Drill bridge. the gap from arrays and maps (really,
> structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> the other?
> >
> > Would be good to finalize the data types and their mapping to plain SQL:
> either keep a type and make it fully work if it has holes, or drop it.
> Unions and Lists are the messiest. They are incomplete in part, because
> they are trying to do the impossible: to predict the future well enough
> that Drill can handle columns with varying or ambiguous data types (that
> is, to handle schema changes.) Is there a better way to handle this issue
> (such as with metadata hints)? That is, rather than fight with conflicting
> types at run time, simply declare the common type in metadata so all
> operators and record batches agree on the type.
> >
> > And, of course, there is the lingering issue of Drill vectors vs. Arrow.
> Arrow did great work in metadata, but seems to have kept some of the
> awkward aspects of Drill's original memory model (lack of control over
> batch sizes, ability to fragment memory.) Might there be a resyncing of the
> two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> Drill's memory improvements, such as the size-limiting "result set loader"
> framework.
> >
> > Big-picture issues such as this tend to get lost in the 2270 open Jira
> tickets. How might the project create some "theme" tickets (or Wiki pages
> or whatever) to help pull the main issues out of the wealth of detail in
> Jira?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> par0328@yahoo.com> wrote:
> >
> > Hi Arina,
> >
> > Thanks for launching this discussion. A few minor suggestions.
> >
> > The developers have done a fantastic job stabilizing and improving
> Drill's core functionality. Now the opportunity is to expand the use cases
> for Drill so that it gets wider adoption within the community. Drill
> competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> differentiator for Drill can be the ability to extend the core and
> integrate Drill into user applications. Of these tools, only Spark has a
> fully ostensible model. Can Drill provide some of the flexibility that has
> powered Spark to success?
> >
> > 1. You mentioned the metastore is under active investigation. Anything
> yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key
> gap in Drill. Simply adding a Hive-like metastore would repeat the very
> errors that Drill was meant to address. Maybe we can toss around ideas for
> a metadata API that provides greater flexibility.
> >
> > 2. Users can extend the core with custom UDFs, storage engines, formats
> and so on. At present, the code to do this is rather hard to write, debug
> and maintain. Is there value in streamlining those interfaces so that a
> wider audience can extend Drill for their specific needs?
> >
> > 3. Similarly, we've seen interest in integrating Drill with other
> systems, which suggests an opportunity for improved APIs. Ability to
> associate options, defaults and restrictions with users. Ability to use the
> REST API for larger data sets and with stateful session options. And so on.
> >
> > Such extensions are best guided by user demands: what can Drill provide
> for production applications to enable simpler/faster/more complete
> integration?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> arina@apache.org> wrote:
> >
> > Hi all,
> >
> > as a new PMC Chair I would like to thank users for choosing and using
> > Apache Drill and contributors /  committers for making improvements and
> > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > improvements and new features. Please feel free to try it out and share
> > your experience. As always we would love to hear your success stories of
> > using Apache Drill.
> >
> > Also I encourage users to share any problems found in Drill, as well as
> any
> > suggestions for future improvements. Feel free to start discussion on the
> > mailing list and then file a Jira with the summary. Contributions are
> > always welcome: minor, major, doc improvements or grammar fixes. Just
> file
> > a Jira and open the PR. Do not hesitate to ping developers on the mailing
> > list if PR is not being timely reviewed.
> >
> > Latest project reports show:
> > Apache Drill project has healthy release schedule, each release includes
> > lots of features.
> > Mailing list (user / dev) are getting substantial support from the active
> > developers, including Stackoverflow and Twitter.
> > New committers are added on the steady basis.
> >
> > Overall project is growing and moving forward. There have been
> discussions
> > about Drill 2.0 last year and currently Drill metastore feature is under
> > active investigation which might the breaking change for 2.0.
> >
> > Please feel free to reply to this email with your comments / concerns /
> > ideas about current project state.
> >
> > Kind regards,
> > Arina
>
>

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

I wonder if we should pop the discussion up a level? What goals should Drill have as an Apache project?

Drill is a big data query engine, and shares that description with Impala, Presto, Hive and (to some degree) Spark. Drill's adoption is currently lower than Impala or Spark. What unique use cases can Drill address that are unserved (or under-served) by the Impala and Spark juggernauts?

To grow, Drill must define its sweet spot: what it does better than any other project. Let's identify why organizations might want to use Drill rather than (or in addition to) the better-known alternatives. Answer that, and we enter a virtuous cycle: organizations will adopt Drill because it does things that other tools don't do (or do poorly). Some of those adopters will want to contribute to Drill for new use cases, which will encourage more adoption.

Drill is like many other projects in their early years: one core vendor has graciously contributed the bulk of the code. (Impala, Hadoop, Spark, Kudu and Kafka are other examples.) Naturally, the work of the core team focuses on the specific needs of that vendor's customers. Ideally, Drill would, like those other tools, gain sufficient adoption that many other organizations contribute as well, broadening the set of supported use cases, and entering that virtuous growth cycle, to everyone's benefit.


The core question: what does the community see as gaps in their big data stacks that Drill can serve?

Thanks,

- Paul

 

    On Tuesday, August 14, 2018, 7:08:51 AM PDT, Arina Yelchiyeva <ar...@gmail.com> wrote:  
 
 1. Regarding Drill metastore, its under investigation, please follow up
with DRILL-6552.
2. UDFs: I would not say, it's that quit to write UDFs in Drill.
Definitely, it could have been done easier but even for current state we
have good manuals. Regarding adding support for different languages like
python, that would require full re-write on UDFs code handling, since Drill
heavily relies on Java source code when during UDFs initialization. Though
generally it's a good idea since, Hive, for example, supports Scala, Python
for UDFs.
3. Drill vs Arrow is the topic I heard since I have started working with
Drill. But so far nobody dared to tackle it. I would suspect Drill first
would have to contribute changes in Arrow to be able to migrate which could
be a show-stopper if Arrow community does not accept them.

On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:

> I’d like to weigh in here as well. As a long time user of Drill, I really
> would like to see more people using it and I think there are a few key
> aspects that could really help on that front.
>
> The first of which is the Arrow integration.  I’m not enough of a software
> engineer to understand all the internal details here, but as I understand
> it, the promise of Arrow is that many tools will share a common memory
> model and that it will be possible to transfer data from one tool to the
> other without having to serialize/deserialize the data.  In the data
> science community many of the major platforms, Python-pandas, R, and Spark
> are moving or have adopted Arrow.
> Drill’s strength is the ease that it can query many different data sources
> and if Drill were to adopt Arrow, I suspect that many people would adopt it
> as a part of a machine learning pipeline.  Just recently, I attempted to do
> some data manipulation using Spark, and couldn’t help but notice how
> difficult ti was in contrast with Drill. I’m sure this is a very complex
> task, but I do think that it could be worth it in the end.
>
> Secondly, I’d like to second Paul’s call to simplify the interfaces for
> UDFs, Format and ideally storage plugins.  A core strength of Drill is its
> extensibility and making it easier would be a great thing.  I was wondering
> whether it would be possible or even a good idea, to enable users to write
> UDFs in a scripting language such as python.
>
> Thirdly,
> i would really like to see us add more functionality to Drill.  @Arina,
> your work to build a storage plugin for ElasticSearch is really great and I
> think more capabilities like that are really needed.  I’d like to see a
> generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can
> figure out how storage plugins work, I’ll gladly work on some of these.
>
> Just my .02.
> — C
>
>
>
>
>
> > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Arina,
> >
> > Another topic would be whether/how to round out Drill's data model.
> Drill's scalar and nullable types are pretty solid. Great work was done
> recently for Decimal (though the old types still remain.) Good support is
> now available for nested types to do implicit joins to produce SQL-friendly
> flat records.
> > But, opportunities for improvement still remain. Date/Time has timezone
> issues. Union, List and Repeated List never quite worked. There are a few
> types identified in the code, but not implemented (dates with TZ, tiny
> ints, etc.) How should Drill bridge. the gap from arrays and maps (really,
> structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> the other?
> >
> > Would be good to finalize the data types and their mapping to plain SQL:
> either keep a type and make it fully work if it has holes, or drop it.
> Unions and Lists are the messiest. They are incomplete in part, because
> they are trying to do the impossible: to predict the future well enough
> that Drill can handle columns with varying or ambiguous data types (that
> is, to handle schema changes.) Is there a better way to handle this issue
> (such as with metadata hints)? That is, rather than fight with conflicting
> types at run time, simply declare the common type in metadata so all
> operators and record batches agree on the type.
> >
> > And, of course, there is the lingering issue of Drill vectors vs. Arrow.
> Arrow did great work in metadata, but seems to have kept some of the
> awkward aspects of Drill's original memory model (lack of control over
> batch sizes, ability to fragment memory.) Might there be a resyncing of the
> two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> Drill's memory improvements, such as the size-limiting "result set loader"
> framework.
> >
> > Big-picture issues such as this tend to get lost in the 2270 open Jira
> tickets. How might the project create some "theme" tickets (or Wiki pages
> or whatever) to help pull the main issues out of the wealth of detail in
> Jira?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> par0328@yahoo.com> wrote:
> >
> > Hi Arina,
> >
> > Thanks for launching this discussion. A few minor suggestions.
> >
> > The developers have done a fantastic job stabilizing and improving
> Drill's core functionality. Now the opportunity is to expand the use cases
> for Drill so that it gets wider adoption within the community. Drill
> competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> differentiator for Drill can be the ability to extend the core and
> integrate Drill into user applications. Of these tools, only Spark has a
> fully ostensible model. Can Drill provide some of the flexibility that has
> powered Spark to success?
> >
> > 1. You mentioned the metastore is under active investigation. Anything
> yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key
> gap in Drill. Simply adding a Hive-like metastore would repeat the very
> errors that Drill was meant to address. Maybe we can toss around ideas for
> a metadata API that provides greater flexibility.
> >
> > 2. Users can extend the core with custom UDFs, storage engines, formats
> and so on. At present, the code to do this is rather hard to write, debug
> and maintain. Is there value in streamlining those interfaces so that a
> wider audience can extend Drill for their specific needs?
> >
> > 3. Similarly, we've seen interest in integrating Drill with other
> systems, which suggests an opportunity for improved APIs. Ability to
> associate options, defaults and restrictions with users. Ability to use the
> REST API for larger data sets and with stateful session options. And so on.
> >
> > Such extensions are best guided by user demands: what can Drill provide
> for production applications to enable simpler/faster/more complete
> integration?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> arina@apache.org> wrote:
> >
> > Hi all,
> >
> > as a new PMC Chair I would like to thank users for choosing and using
> > Apache Drill and contributors /  committers for making improvements and
> > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > improvements and new features. Please feel free to try it out and share
> > your experience. As always we would love to hear your success stories of
> > using Apache Drill.
> >
> > Also I encourage users to share any problems found in Drill, as well as
> any
> > suggestions for future improvements. Feel free to start discussion on the
> > mailing list and then file a Jira with the summary. Contributions are
> > always welcome: minor, major, doc improvements or grammar fixes. Just
> file
> > a Jira and open the PR. Do not hesitate to ping developers on the mailing
> > list if PR is not being timely reviewed.
> >
> > Latest project reports show:
> > Apache Drill project has healthy release schedule, each release includes
> > lots of features.
> > Mailing list (user / dev) are getting substantial support from the active
> > developers, including Stackoverflow and Twitter.
> > New committers are added on the steady basis.
> >
> > Overall project is growing and moving forward. There have been
> discussions
> > about Drill 2.0 last year and currently Drill metastore feature is under
> > active investigation which might the breaking change for 2.0.
> >
> > Please feel free to reply to this email with your comments / concerns /
> > ideas about current project state.
> >
> > Kind regards,
> > Arina
>
>

Re: [DISCUSSION] current project state

Posted by Arina Yelchiyeva <ar...@gmail.com>.

1. Regarding Drill metastore, its under investigation, please follow up
with DRILL-6552.
2. UDFs: I would not say, it's that quit to write UDFs in Drill.
Definitely, it could have been done easier but even for current state we
have good manuals. Regarding adding support for different languages like
python, that would require full re-write on UDFs code handling, since Drill
heavily relies on Java source code when during UDFs initialization. Though
generally it's a good idea since, Hive, for example, supports Scala, Python
for UDFs.
3. Drill vs Arrow is the topic I heard since I have started working with
Drill. But so far nobody dared to tackle it. I would suspect Drill first
would have to contribute changes in Arrow to be able to migrate which could
be a show-stopper if Arrow community does not accept them.

On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:

> I’d like to weigh in here as well. As a long time user of Drill, I really
> would like to see more people using it and I think there are a few key
> aspects that could really help on that front.
>
> The first of which is the Arrow integration.  I’m not enough of a software
> engineer to understand all the internal details here, but as I understand
> it, the promise of Arrow is that many tools will share a common memory
> model and that it will be possible to transfer data from one tool to the
> other without having to serialize/deserialize the data.  In the data
> science community many of the major platforms, Python-pandas, R, and Spark
> are moving or have adopted Arrow.
> Drill’s strength is the ease that it can query many different data sources
> and if Drill were to adopt Arrow, I suspect that many people would adopt it
> as a part of a machine learning pipeline.  Just recently, I attempted to do
> some data manipulation using Spark, and couldn’t help but notice how
> difficult ti was in contrast with Drill. I’m sure this is a very complex
> task, but I do think that it could be worth it in the end.
>
> Secondly, I’d like to second Paul’s call to simplify the interfaces for
> UDFs, Format and ideally storage plugins.  A core strength of Drill is its
> extensibility and making it easier would be a great thing.  I was wondering
> whether it would be possible or even a good idea, to enable users to write
> UDFs in a scripting language such as python.
>
> Thirdly,
> i would really like to see us add more functionality to Drill.  @Arina,
> your work to build a storage plugin for ElasticSearch is really great and I
> think more capabilities like that are really needed.  I’d like to see a
> generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can
> figure out how storage plugins work, I’ll gladly work on some of these.
>
> Just my .02.
> — C
>
>
>
>
>
> > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Arina,
> >
> > Another topic would be whether/how to round out Drill's data model.
> Drill's scalar and nullable types are pretty solid. Great work was done
> recently for Decimal (though the old types still remain.) Good support is
> now available for nested types to do implicit joins to produce SQL-friendly
> flat records.
> > But, opportunities for improvement still remain. Date/Time has timezone
> issues. Union, List and Repeated List never quite worked. There are a few
> types identified in the code, but not implemented (dates with TZ, tiny
> ints, etc.) How should Drill bridge. the gap from arrays and maps (really,
> structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> the other?
> >
> > Would be good to finalize the data types and their mapping to plain SQL:
> either keep a type and make it fully work if it has holes, or drop it.
> Unions and Lists are the messiest. They are incomplete in part, because
> they are trying to do the impossible: to predict the future well enough
> that Drill can handle columns with varying or ambiguous data types (that
> is, to handle schema changes.) Is there a better way to handle this issue
> (such as with metadata hints)? That is, rather than fight with conflicting
> types at run time, simply declare the common type in metadata so all
> operators and record batches agree on the type.
> >
> > And, of course, there is the lingering issue of Drill vectors vs. Arrow.
> Arrow did great work in metadata, but seems to have kept some of the
> awkward aspects of Drill's original memory model (lack of control over
> batch sizes, ability to fragment memory.) Might there be a resyncing of the
> two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> Drill's memory improvements, such as the size-limiting "result set loader"
> framework.
> >
> > Big-picture issues such as this tend to get lost in the 2270 open Jira
> tickets. How might the project create some "theme" tickets (or Wiki pages
> or whatever) to help pull the main issues out of the wealth of detail in
> Jira?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> par0328@yahoo.com> wrote:
> >
> > Hi Arina,
> >
> > Thanks for launching this discussion. A few minor suggestions.
> >
> > The developers have done a fantastic job stabilizing and improving
> Drill's core functionality. Now the opportunity is to expand the use cases
> for Drill so that it gets wider adoption within the community. Drill
> competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> differentiator for Drill can be the ability to extend the core and
> integrate Drill into user applications. Of these tools, only Spark has a
> fully ostensible model. Can Drill provide some of the flexibility that has
> powered Spark to success?
> >
> > 1. You mentioned the metastore is under active investigation. Anything
> yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key
> gap in Drill. Simply adding a Hive-like metastore would repeat the very
> errors that Drill was meant to address. Maybe we can toss around ideas for
> a metadata API that provides greater flexibility.
> >
> > 2. Users can extend the core with custom UDFs, storage engines, formats
> and so on. At present, the code to do this is rather hard to write, debug
> and maintain. Is there value in streamlining those interfaces so that a
> wider audience can extend Drill for their specific needs?
> >
> > 3. Similarly, we've seen interest in integrating Drill with other
> systems, which suggests an opportunity for improved APIs. Ability to
> associate options, defaults and restrictions with users. Ability to use the
> REST API for larger data sets and with stateful session options. And so on.
> >
> > Such extensions are best guided by user demands: what can Drill provide
> for production applications to enable simpler/faster/more complete
> integration?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> arina@apache.org> wrote:
> >
> > Hi all,
> >
> > as a new PMC Chair I would like to thank users for choosing and using
> > Apache Drill and contributors /  committers for making improvements and
> > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > improvements and new features. Please feel free to try it out and share
> > your experience. As always we would love to hear your success stories of
> > using Apache Drill.
> >
> > Also I encourage users to share any problems found in Drill, as well as
> any
> > suggestions for future improvements. Feel free to start discussion on the
> > mailing list and then file a Jira with the summary. Contributions are
> > always welcome: minor, major, doc improvements or grammar fixes. Just
> file
> > a Jira and open the PR. Do not hesitate to ping developers on the mailing
> > list if PR is not being timely reviewed.
> >
> > Latest project reports show:
> > Apache Drill project has healthy release schedule, each release includes
> > lots of features.
> > Mailing list (user / dev) are getting substantial support from the active
> > developers, including Stackoverflow and Twitter.
> > New committers are added on the steady basis.
> >
> > Overall project is growing and moving forward. There have been
> discussions
> > about Drill 2.0 last year and currently Drill metastore feature is under
> > active investigation which might the breaking change for 2.0.
> >
> > Please feel free to reply to this email with your comments / concerns /
> > ideas about current project state.
> >
> > Kind regards,
> > Arina
>
>

Re: [DISCUSSION] current project state

Posted by Arina Yelchiyeva <ar...@gmail.com>.

1. Regarding Drill metastore, its under investigation, please follow up
with DRILL-6552.
2. UDFs: I would not say, it's that quit to write UDFs in Drill.
Definitely, it could have been done easier but even for current state we
have good manuals. Regarding adding support for different languages like
python, that would require full re-write on UDFs code handling, since Drill
heavily relies on Java source code when during UDFs initialization. Though
generally it's a good idea since, Hive, for example, supports Scala, Python
for UDFs.
3. Drill vs Arrow is the topic I heard since I have started working with
Drill. But so far nobody dared to tackle it. I would suspect Drill first
would have to contribute changes in Arrow to be able to migrate which could
be a show-stopper if Arrow community does not accept them.

On Tue, Aug 14, 2018 at 6:37 AM Charles Givre <cg...@gmail.com> wrote:

> I’d like to weigh in here as well. As a long time user of Drill, I really
> would like to see more people using it and I think there are a few key
> aspects that could really help on that front.
>
> The first of which is the Arrow integration.  I’m not enough of a software
> engineer to understand all the internal details here, but as I understand
> it, the promise of Arrow is that many tools will share a common memory
> model and that it will be possible to transfer data from one tool to the
> other without having to serialize/deserialize the data.  In the data
> science community many of the major platforms, Python-pandas, R, and Spark
> are moving or have adopted Arrow.
> Drill’s strength is the ease that it can query many different data sources
> and if Drill were to adopt Arrow, I suspect that many people would adopt it
> as a part of a machine learning pipeline.  Just recently, I attempted to do
> some data manipulation using Spark, and couldn’t help but notice how
> difficult ti was in contrast with Drill. I’m sure this is a very complex
> task, but I do think that it could be worth it in the end.
>
> Secondly, I’d like to second Paul’s call to simplify the interfaces for
> UDFs, Format and ideally storage plugins.  A core strength of Drill is its
> extensibility and making it easier would be a great thing.  I was wondering
> whether it would be possible or even a good idea, to enable users to write
> UDFs in a scripting language such as python.
>
> Thirdly,
> i would really like to see us add more functionality to Drill.  @Arina,
> your work to build a storage plugin for ElasticSearch is really great and I
> think more capabilities like that are really needed.  I’d like to see a
> generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can
> figure out how storage plugins work, I’ll gladly work on some of these.
>
> Just my .02.
> — C
>
>
>
>
>
> > On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Arina,
> >
> > Another topic would be whether/how to round out Drill's data model.
> Drill's scalar and nullable types are pretty solid. Great work was done
> recently for Decimal (though the old types still remain.) Good support is
> now available for nested types to do implicit joins to produce SQL-friendly
> flat records.
> > But, opportunities for improvement still remain. Date/Time has timezone
> issues. Union, List and Repeated List never quite worked. There are a few
> types identified in the code, but not implemented (dates with TZ, tiny
> ints, etc.) How should Drill bridge. the gap from arrays and maps (really,
> structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
> the other?
> >
> > Would be good to finalize the data types and their mapping to plain SQL:
> either keep a type and make it fully work if it has holes, or drop it.
> Unions and Lists are the messiest. They are incomplete in part, because
> they are trying to do the impossible: to predict the future well enough
> that Drill can handle columns with varying or ambiguous data types (that
> is, to handle schema changes.) Is there a better way to handle this issue
> (such as with metadata hints)? That is, rather than fight with conflicting
> types at run time, simply declare the common type in metadata so all
> operators and record batches agree on the type.
> >
> > And, of course, there is the lingering issue of Drill vectors vs. Arrow.
> Arrow did great work in metadata, but seems to have kept some of the
> awkward aspects of Drill's original memory model (lack of control over
> batch sizes, ability to fragment memory.) Might there be a resyncing of the
> two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up
> Drill's memory improvements, such as the size-limiting "result set loader"
> framework.
> >
> > Big-picture issues such as this tend to get lost in the 2270 open Jira
> tickets. How might the project create some "theme" tickets (or Wiki pages
> or whatever) to help pull the main issues out of the wealth of detail in
> Jira?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <
> par0328@yahoo.com> wrote:
> >
> > Hi Arina,
> >
> > Thanks for launching this discussion. A few minor suggestions.
> >
> > The developers have done a fantastic job stabilizing and improving
> Drill's core functionality. Now the opportunity is to expand the use cases
> for Drill so that it gets wider adoption within the community. Drill
> competes for mindshare with Impala, Presto, Hive, Spark and others. A key
> differentiator for Drill can be the ability to extend the core and
> integrate Drill into user applications. Of these tools, only Spark has a
> fully ostensible model. Can Drill provide some of the flexibility that has
> powered Spark to success?
> >
> > 1. You mentioned the metastore is under active investigation. Anything
> yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key
> gap in Drill. Simply adding a Hive-like metastore would repeat the very
> errors that Drill was meant to address. Maybe we can toss around ideas for
> a metadata API that provides greater flexibility.
> >
> > 2. Users can extend the core with custom UDFs, storage engines, formats
> and so on. At present, the code to do this is rather hard to write, debug
> and maintain. Is there value in streamlining those interfaces so that a
> wider audience can extend Drill for their specific needs?
> >
> > 3. Similarly, we've seen interest in integrating Drill with other
> systems, which suggests an opportunity for improved APIs. Ability to
> associate options, defaults and restrictions with users. Ability to use the
> REST API for larger data sets and with stateful session options. And so on.
> >
> > Such extensions are best guided by user demands: what can Drill provide
> for production applications to enable simpler/faster/more complete
> integration?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <
> arina@apache.org> wrote:
> >
> > Hi all,
> >
> > as a new PMC Chair I would like to thank users for choosing and using
> > Apache Drill and contributors /  committers for making improvements and
> > fixes. Recently Apache Drill 1.14 was released bundled up with many
> > improvements and new features. Please feel free to try it out and share
> > your experience. As always we would love to hear your success stories of
> > using Apache Drill.
> >
> > Also I encourage users to share any problems found in Drill, as well as
> any
> > suggestions for future improvements. Feel free to start discussion on the
> > mailing list and then file a Jira with the summary. Contributions are
> > always welcome: minor, major, doc improvements or grammar fixes. Just
> file
> > a Jira and open the PR. Do not hesitate to ping developers on the mailing
> > list if PR is not being timely reviewed.
> >
> > Latest project reports show:
> > Apache Drill project has healthy release schedule, each release includes
> > lots of features.
> > Mailing list (user / dev) are getting substantial support from the active
> > developers, including Stackoverflow and Twitter.
> > New committers are added on the steady basis.
> >
> > Overall project is growing and moving forward. There have been
> discussions
> > about Drill 2.0 last year and currently Drill metastore feature is under
> > active investigation which might the breaking change for 2.0.
> >
> > Please feel free to reply to this email with your comments / concerns /
> > ideas about current project state.
> >
> > Kind regards,
> > Arina
>
>

Re: [DISCUSSION] current project state

Posted by Charles Givre <cg...@gmail.com>.

I’d like to weigh in here as well. As a long time user of Drill, I really would like to see more people using it and I think there are a few key aspects that could really help on that front. 

The first of which is the Arrow integration.  I’m not enough of a software engineer to understand all the internal details here, but as I understand it, the promise of Arrow is that many tools will share a common memory model and that it will be possible to transfer data from one tool to the other without having to serialize/deserialize the data.  In the data science community many of the major platforms, Python-pandas, R, and Spark are moving or have adopted Arrow.  
Drill’s strength is the ease that it can query many different data sources and if Drill were to adopt Arrow, I suspect that many people would adopt it as a part of a machine learning pipeline.  Just recently, I attempted to do some data manipulation using Spark, and couldn’t help but notice how difficult ti was in contrast with Drill. I’m sure this is a very complex task, but I do think that it could be worth it in the end. 

Secondly, I’d like to second Paul’s call to simplify the interfaces for UDFs, Format and ideally storage plugins.  A core strength of Drill is its extensibility and making it easier would be a great thing.  I was wondering whether it would be possible or even a good idea, to enable users to write UDFs in a scripting language such as python. 

Thirdly, 
i would really like to see us add more functionality to Drill.  @Arina, your work to build a storage plugin for ElasticSearch is really great and I think more capabilities like that are really needed.  I’d like to see a generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can figure out how storage plugins work, I’ll gladly work on some of these. 

Just my .02.
— C





> On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Arina,
> 
> Another topic would be whether/how to round out Drill's data model. Drill's scalar and nullable types are pretty solid. Great work was done recently for Decimal (though the old types still remain.) Good support is now available for nested types to do implicit joins to produce SQL-friendly flat records. 
> But, opportunities for improvement still remain. Date/Time has timezone issues. Union, List and Repeated List never quite worked. There are a few types identified in the code, but not implemented (dates with TZ, tiny ints, etc.) How should Drill bridge. the gap from arrays and maps (really, structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on the other?
> 
> Would be good to finalize the data types and their mapping to plain SQL: either keep a type and make it fully work if it has holes, or drop it. Unions and Lists are the messiest. They are incomplete in part, because they are trying to do the impossible: to predict the future well enough that Drill can handle columns with varying or ambiguous data types (that is, to handle schema changes.) Is there a better way to handle this issue (such as with metadata hints)? That is, rather than fight with conflicting types at run time, simply declare the common type in metadata so all operators and record batches agree on the type.
> 
> And, of course, there is the lingering issue of Drill vectors vs. Arrow. Arrow did great work in metadata, but seems to have kept some of the awkward aspects of Drill's original memory model (lack of control over batch sizes, ability to fragment memory.) Might there be a resyncing of the two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory improvements, such as the size-limiting "result set loader" framework.
> 
> Big-picture issues such as this tend to get lost in the 2270 open Jira tickets. How might the project create some "theme" tickets (or Wiki pages or whatever) to help pull the main issues out of the wealth of detail in Jira?
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:  
> 
> Hi Arina,
> 
> Thanks for launching this discussion. A few minor suggestions.
> 
> The developers have done a fantastic job stabilizing and improving Drill's core functionality. Now the opportunity is to expand the use cases for Drill so that it gets wider adoption within the community. Drill competes for mindshare with Impala, Presto, Hive, Spark and others. A key differentiator for Drill can be the ability to extend the core and integrate Drill into user applications. Of these tools, only Spark has a fully ostensible model. Can Drill provide some of the flexibility that has powered Spark to success?
> 
> 1. You mentioned the metastore is under active investigation. Anything yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key gap in Drill. Simply adding a Hive-like metastore would repeat the very errors that Drill was meant to address. Maybe we can toss around ideas for a metadata API that provides greater flexibility.
> 
> 2. Users can extend the core with custom UDFs, storage engines, formats and so on. At present, the code to do this is rather hard to write, debug and maintain. Is there value in streamlining those interfaces so that a wider audience can extend Drill for their specific needs?
> 
> 3. Similarly, we've seen interest in integrating Drill with other systems, which suggests an opportunity for improved APIs. Ability to associate options, defaults and restrictions with users. Ability to use the REST API for larger data sets and with stateful session options. And so on.
> 
> Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?  
> 
> Thanks,
> 
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:  
> 
> Hi all,
> 
> as a new PMC Chair I would like to thank users for choosing and using
> Apache Drill and contributors /  committers for making improvements and
> fixes. Recently Apache Drill 1.14 was released bundled up with many
> improvements and new features. Please feel free to try it out and share
> your experience. As always we would love to hear your success stories of
> using Apache Drill.
> 
> Also I encourage users to share any problems found in Drill, as well as any
> suggestions for future improvements. Feel free to start discussion on the
> mailing list and then file a Jira with the summary. Contributions are
> always welcome: minor, major, doc improvements or grammar fixes. Just file
> a Jira and open the PR. Do not hesitate to ping developers on the mailing
> list if PR is not being timely reviewed.
> 
> Latest project reports show:
> Apache Drill project has healthy release schedule, each release includes
> lots of features.
> Mailing list (user / dev) are getting substantial support from the active
> developers, including Stackoverflow and Twitter.
> New committers are added on the steady basis.
> 
> Overall project is growing and moving forward. There have been discussions
> about Drill 2.0 last year and currently Drill metastore feature is under
> active investigation which might the breaking change for 2.0.
> 
> Please feel free to reply to this email with your comments / concerns /
> ideas about current project state.
> 
> Kind regards,
> Arina

Re: [DISCUSSION] current project state

Posted by Charles Givre <cg...@gmail.com>.

I’d like to weigh in here as well. As a long time user of Drill, I really would like to see more people using it and I think there are a few key aspects that could really help on that front. 

The first of which is the Arrow integration.  I’m not enough of a software engineer to understand all the internal details here, but as I understand it, the promise of Arrow is that many tools will share a common memory model and that it will be possible to transfer data from one tool to the other without having to serialize/deserialize the data.  In the data science community many of the major platforms, Python-pandas, R, and Spark are moving or have adopted Arrow.  
Drill’s strength is the ease that it can query many different data sources and if Drill were to adopt Arrow, I suspect that many people would adopt it as a part of a machine learning pipeline.  Just recently, I attempted to do some data manipulation using Spark, and couldn’t help but notice how difficult ti was in contrast with Drill. I’m sure this is a very complex task, but I do think that it could be worth it in the end. 

Secondly, I’d like to second Paul’s call to simplify the interfaces for UDFs, Format and ideally storage plugins.  A core strength of Drill is its extensibility and making it easier would be a great thing.  I was wondering whether it would be possible or even a good idea, to enable users to write UDFs in a scripting language such as python. 

Thirdly, 
i would really like to see us add more functionality to Drill.  @Arina, your work to build a storage plugin for ElasticSearch is really great and I think more capabilities like that are really needed.  I’d like to see a generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can figure out how storage plugins work, I’ll gladly work on some of these. 

Just my .02.
— C





> On Aug 13, 2018, at 21:21, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Arina,
> 
> Another topic would be whether/how to round out Drill's data model. Drill's scalar and nullable types are pretty solid. Great work was done recently for Decimal (though the old types still remain.) Good support is now available for nested types to do implicit joins to produce SQL-friendly flat records. 
> But, opportunities for improvement still remain. Date/Time has timezone issues. Union, List and Repeated List never quite worked. There are a few types identified in the code, but not implemented (dates with TZ, tiny ints, etc.) How should Drill bridge. the gap from arrays and maps (really, structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on the other?
> 
> Would be good to finalize the data types and their mapping to plain SQL: either keep a type and make it fully work if it has holes, or drop it. Unions and Lists are the messiest. They are incomplete in part, because they are trying to do the impossible: to predict the future well enough that Drill can handle columns with varying or ambiguous data types (that is, to handle schema changes.) Is there a better way to handle this issue (such as with metadata hints)? That is, rather than fight with conflicting types at run time, simply declare the common type in metadata so all operators and record batches agree on the type.
> 
> And, of course, there is the lingering issue of Drill vectors vs. Arrow. Arrow did great work in metadata, but seems to have kept some of the awkward aspects of Drill's original memory model (lack of control over batch sizes, ability to fragment memory.) Might there be a resyncing of the two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory improvements, such as the size-limiting "result set loader" framework.
> 
> Big-picture issues such as this tend to get lost in the 2270 open Jira tickets. How might the project create some "theme" tickets (or Wiki pages or whatever) to help pull the main issues out of the wealth of detail in Jira?
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:  
> 
> Hi Arina,
> 
> Thanks for launching this discussion. A few minor suggestions.
> 
> The developers have done a fantastic job stabilizing and improving Drill's core functionality. Now the opportunity is to expand the use cases for Drill so that it gets wider adoption within the community. Drill competes for mindshare with Impala, Presto, Hive, Spark and others. A key differentiator for Drill can be the ability to extend the core and integrate Drill into user applications. Of these tools, only Spark has a fully ostensible model. Can Drill provide some of the flexibility that has powered Spark to success?
> 
> 1. You mentioned the metastore is under active investigation. Anything yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key gap in Drill. Simply adding a Hive-like metastore would repeat the very errors that Drill was meant to address. Maybe we can toss around ideas for a metadata API that provides greater flexibility.
> 
> 2. Users can extend the core with custom UDFs, storage engines, formats and so on. At present, the code to do this is rather hard to write, debug and maintain. Is there value in streamlining those interfaces so that a wider audience can extend Drill for their specific needs?
> 
> 3. Similarly, we've seen interest in integrating Drill with other systems, which suggests an opportunity for improved APIs. Ability to associate options, defaults and restrictions with users. Ability to use the REST API for larger data sets and with stateful session options. And so on.
> 
> Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?  
> 
> Thanks,
> 
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:  
> 
> Hi all,
> 
> as a new PMC Chair I would like to thank users for choosing and using
> Apache Drill and contributors /  committers for making improvements and
> fixes. Recently Apache Drill 1.14 was released bundled up with many
> improvements and new features. Please feel free to try it out and share
> your experience. As always we would love to hear your success stories of
> using Apache Drill.
> 
> Also I encourage users to share any problems found in Drill, as well as any
> suggestions for future improvements. Feel free to start discussion on the
> mailing list and then file a Jira with the summary. Contributions are
> always welcome: minor, major, doc improvements or grammar fixes. Just file
> a Jira and open the PR. Do not hesitate to ping developers on the mailing
> list if PR is not being timely reviewed.
> 
> Latest project reports show:
> Apache Drill project has healthy release schedule, each release includes
> lots of features.
> Mailing list (user / dev) are getting substantial support from the active
> developers, including Stackoverflow and Twitter.
> New committers are added on the steady basis.
> 
> Overall project is growing and moving forward. There have been discussions
> about Drill 2.0 last year and currently Drill metastore feature is under
> active investigation which might the breaking change for 2.0.
> 
> Please feel free to reply to this email with your comments / concerns /
> ideas about current project state.
> 
> Kind regards,
> Arina

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Arina,

Another topic would be whether/how to round out Drill's data model. Drill's scalar and nullable types are pretty solid. Great work was done recently for Decimal (though the old types still remain.) Good support is now available for nested types to do implicit joins to produce SQL-friendly flat records.
But, opportunities for improvement still remain. Date/Time has timezone issues. Union, List and Repeated List never quite worked. There are a few types identified in the code, but not implemented (dates with TZ, tiny ints, etc.) How should Drill bridge. the gap from arrays and maps (really, structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on the other?

Would be good to finalize the data types and their mapping to plain SQL: either keep a type and make it fully work if it has holes, or drop it. Unions and Lists are the messiest. They are incomplete in part, because they are trying to do the impossible: to predict the future well enough that Drill can handle columns with varying or ambiguous data types (that is, to handle schema changes.) Is there a better way to handle this issue (such as with metadata hints)? That is, rather than fight with conflicting types at run time, simply declare the common type in metadata so all operators and record batches agree on the type.

And, of course, there is the lingering issue of Drill vectors vs. Arrow. Arrow did great work in metadata, but seems to have kept some of the awkward aspects of Drill's original memory model (lack of control over batch sizes, ability to fragment memory.) Might there be a resyncing of the two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory improvements, such as the size-limiting "result set loader" framework.

Big-picture issues such as this tend to get lost in the 2270 open Jira tickets. How might the project create some "theme" tickets (or Wiki pages or whatever) to help pull the main issues out of the wealth of detail in Jira?

Thanks,
- Paul

On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:

Hi Arina,

Thanks for launching this discussion. A few minor suggestions.

The developers have done a fantastic job stabilizing and improving Drill's core functionality. Now the opportunity is to expand the use cases for Drill so that it gets wider adoption within the community. Drill competes for mindshare with Impala, Presto, Hive, Spark and others. A key differentiator for Drill can be the ability to extend the core and integrate Drill into user applications. Of these tools, only Spark has a fully ostensible model. Can Drill provide some of the flexibility that has powered Spark to success?

1. You mentioned the metastore is under active investigation. Anything yet to share? Didn't see any activity on the JIRA ticket. Metadata is a key gap in Drill. Simply adding a Hive-like metastore would repeat the very errors that Drill was meant to address. Maybe we can toss around ideas for a metadata API that provides greater flexibility.

2. Users can extend the core with custom UDFs, storage engines, formats and so on. At present, the code to do this is rather hard to write, debug and maintain. Is there value in streamlining those interfaces so that a wider audience can extend Drill for their specific needs?

3. Similarly, we've seen interest in integrating Drill with other systems, which suggests an opportunity for improved APIs. Ability to associate options, defaults and restrictions with users. Ability to use the REST API for larger data sets and with stateful session options. And so on.

Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?

Thanks,

- Paul

On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:

Hi all,

as a new PMC Chair I would like to thank users for choosing and using
Apache Drill and contributors / committers for making improvements and
fixes. Recently Apache Drill 1.14 was released bundled up with many
improvements and new features. Please feel free to try it out and share
your experience. As always we would love to hear your success stories of
using Apache Drill.

Also I encourage users to share any problems found in Drill, as well as any
suggestions for future improvements. Feel free to start discussion on the
mailing list and then file a Jira with the summary. Contributions are
always welcome: minor, major, doc improvements or grammar fixes. Just file
a Jira and open the PR. Do not hesitate to ping developers on the mailing
list if PR is not being timely reviewed.

Latest project reports show:
Apache Drill project has healthy release schedule, each release includes
lots of features.
Mailing list (user / dev) are getting substantial support from the active
developers, including Stackoverflow and Twitter.
New committers are added on the steady basis.

Overall project is growing and moving forward. There have been discussions
about Drill 2.0 last year and currently Drill metastore feature is under
active investigation which might the breaking change for 2.0.

Please feel free to reply to this email with your comments / concerns /
ideas about current project state.

Kind regards,
Arina

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Arina,

Thanks,
- Paul

On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <pa...@yahoo.com> wrote:

Hi Arina,

Thanks for launching this discussion. A few minor suggestions.

Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?

Thanks,

- Paul

On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:

Hi all,

Please feel free to reply to this email with your comments / concerns /
ideas about current project state.

Kind regards,
Arina

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Arina,

Thanks for launching this discussion. A few minor suggestions.

Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?

Thanks,

- Paul

On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:

Hi all,

Please feel free to reply to this email with your comments / concerns /
ideas about current project state.

Kind regards,
Arina

Re: [DISCUSSION] current project state

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Arina,

Thanks for launching this discussion. A few minor suggestions.

Such extensions are best guided by user demands: what can Drill provide for production applications to enable simpler/faster/more complete integration?

Thanks,

- Paul

On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <ar...@apache.org> wrote:

Hi all,

Please feel free to reply to this email with your comments / concerns /
ideas about current project state.

Kind regards,
Arina