You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Charles Givre <cg...@gmail.com> on 2022/02/06 13:29:42 UTC

[DISCUSS] Some ideas for Drill 1.21

Hello all, 
Firstly, I wanted to thank everyone for all the work that has gone into Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA improvements.  As my company wades further and further into Drill, it has become apparent that the INFO_SCHEMA could use some attention.  James Turton submitted a PR which was merged into Drill 1.20, but in so doing he uncovered an entire Pandora's box of other issues which might be worth addressing.  In a nutshell, the issues with the INFO_SCHEMA are all performance related: it can be very slow and also can consume significant resources when executing even basic queries.  

My understanding of how the info schema (IS) works is that when a user executes a query, Drill will attempt to instantiate every enabled storage plugin to discover schemata and other information. As you might imagine, this can be costly. 

So, (and again, this is only meant as a conversation starter), I was thinking there are some general ideas as to how we might improve the IS:
1.  Implement a limit pushdown:  As far as I can tell, there is no limit pushdown in the IS and this could be a relatively quick win for improving IS query performance.
2.  Caching:  I understand that caching is tricky, but perhaps we could add some sort of schema caching for IS queries, or make better use of the Drill metastore to reduce the number of connections during IS queries.  Perhaps in combination with the metastore, we could implement some sort of "metastore first" plan, whereby Drill first hits the metastore for query results and if the limit is reached, we're done.  If not, query the storage plugins...
3.  Parallelization:  It did not appear to me that Drill parallelizes IS queries.   We may be able to add some parallelization which would improve overall speed, but not necessarily reduce overall compute cost
4.  Convert to EVF2:  Not sure that there's a performance benefit here, but at least we could get rid of cruft
5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the IS seems to obtain a POJO from the storage plugin then write these results to old-school Drill vectors.  I'm sure there was a reason it was done this way, (or maybe not) but I have to wonder if there is a more efficient way of obtaining the information from the storage plugin, ideally w/o all the object creation. 

These are just some thoughts, and I'm curious as to what the community thinks about this.  Thanks everyone!
-- C

回复:[DISCUSS] Some ideas for Drill 1.21

Posted by "王腾飞(飞腾)" <fe...@cainiao.com>.
Hi Paul,

As I recall, the table will be checked if exist in the schema queried when validate sql node in SqlValidatorImpl of calcite, and for some plugins it may also check if column exists in the table.
------------------------------------------------------------------
发件人:王腾飞(飞腾) <fe...@cainiao.com>
发送时间:2022年2月11日(星期五) 21:22
收件人:Paul Rogers <pa...@gmail.com>; dev <de...@drill.apache.org>
主 题:回复:[DISCUSS] Some ideas for Drill 1.21


Hi Paul,

As I recall, the table will be checked if exist in the schema queried when validate sql node in SqlValidatorImpl, and for some plugins it may also check if column exists in the table.
------------------------------------------------------------------
发件人:Paul Rogers <pa...@gmail.com>
发送时间:2022年2月10日(星期四) 11:06
收件人:dev <de...@drill.apache.org>
主 题:Re: [DISCUSS] Some ideas for Drill 1.21

Hi All,

Would be great to understand the source of the slow planning. Back in the
day, I recall colleagues trying all kinds of things to speed up planning,
but without the time to really figure out where the time went.

I wonder if the two points are related. If most of that planning time is
spent waiting for a plugin metadata, then James' & Charles' issue could
possibly be the cause of the slowness that Ted saw.

James, it is still not clear what plugin metadata is being retrieved, and
when. Now, it is hard to figure that out; that code is complex. Ideally, if
you have a dozen plugins enabled, but query only one, then only that one
should be doing anything. Further, if you're using an external system (like
JDBC), the plugin should query the remote system tables only for the
table(s) you hit in your query. If the code asks ALL plugins for
information, or grabs all tables from the remote system, they, yeah, it's
going to be slow.

Adding per-plugin caching might make sense. For JDBC, say, it is not likely
that the schema of the remote DB changes between queries, so caching for
some amount of time is probably fine. And, if a query asks for an unknown
column, the plugin could refresh metadata to see if the column was just
added. (I was told that Impala users constantly had to run REFRESH METADATA
to pick up new files added to HDFS.)

For the classic, original use case (Parquet or CSV files on an HDFS-like
system), the problem was the need to scan the directory structure at plan
time to figure out which files to scan at run time. For Parquet, the
planner also wants to do Parquet row group pruning, which requires reading
the header of every one of the target files. Since this was slow, Drill
would create a quick & dirty cache, but with large numbers of files, even
reading that cache was slow (and, Drill would rebuild it any time a
directory changed, which greatly slowed planning.)

For that classic use case, saved plans never seemed a win because the
"shape" of the query heavily depended on the WHERE clause: one clause might
hit a small set of files, another hit a large set, and that then throws off
join planning, hash/broadcast exchange decisions and so on.

So, back to the suggestion to start with understanding where the time goes.
Any silly stuff we can just stop doing? Is the cost due to external
factors, such as those cited above? Or, is Calcite itself just heavy
weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
and the cost of planning rises steeply. So, are we fiddling about too much
in the planning process?

One way to test: use a mock data source and plan-time components to
eliminate all external factors. Time various query shapes using EXPLAIN.
How long does Calcite take? If a long time, then we've got a rather
difficult problem as Calcite is hard to fix/replace.

Then, time the plugins of interest. Figure out how to optimize those.

My guess is that the bottleneck won't turn out to be what we think it is.
It usually isn't.

- Paul

On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning <te...@gmail.com> wrote:

> James, you make some good points.
>
> I would generally support what you say except for one special case. I think
> that there is a case to be made to be able to cache query plans in some
> fashion.
>
> The traditional approach to do this is to use "prepared queries" by which
> the application signals that it is willing to trust that a query plan will
> continue to be correct for the duration of its execution. My experience
> (and I think the industry's as well) is that the query plan is more stable
> than the underlying details of the metadata and this level of caching (or
> more) is a very good idea.
>
> In particular, the benefit to Drill is that we have a very expensive query
> planning phase (I have seen numbers in the range 200-800ms routinely) but I
> have seen execution times that are as low as a few 10's of ms. This
> imbalance severely compromises the rate of concurrent querying for fast
> queries. Having some form of plan caching would allow this planning
> overhead to drop to zero in select cases.
>
> I have been unable to even consider working on this problem, but it seems
> that one interesting heuristic would be based on two factors:
> - the ratio of execution time to planning time
> The rationale is that if a query takes much longer to run than to plan, we
> might as well do planning each time. Conversely, if a query takes much less
> time to run than it takes to plan, it is very important to avoid that
> planning time.
>
> - the degree to which recent execution times seem inconsistent with longer
> history
> The rationale here is that a persistent drop in performance for a query is
> a strong indicator that any cached plan is no longer valid and should be
> updated. Conversely, if recent query history is consistent with long-term
> history, that is a vote of confidence for the plan. Furthermore, depending
> on how this is implemented the chance of a false positive change detection
> may increase with very long histories which gives a desirable side effect
> of occasional replanning.
>
> What do other people think about this?
>
>
>
>
> On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:
>
> > My 2c while I wait for 1.20.0 RC1 to upload.
> >
> > I think it's good that we continue to bring every design decision out
> > here to the community like Charles did with this one.  Some relevant
> > excerpts I turned up while zipping around a few ASF docs now.
> >
> >     "Mailing lists are the virtual rooms where ASF communities live,
> >     form and grow. All formal decisions the project's PMC makes need to
> >     have an email thread (possibly with a recorded vote) as an audit
> >     trail that this was an official decision." [1]
> >
> >     "We firmly believe in hats
> >     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> >     role at the ASF is one assigned to you personally, and is bestowed
> >     on you by your peers. It is not tied to your job or current employer
> >     or company." [2]
> >
> >     "Unless they specifically state otherwise, whatever an ASF
> >     participant posts on any mailing list is done /as themselves/. It is
> >     the individual point-of-view, wearing their personal hat and not as
> >     a mouthpiece for whatever company happens to be signing their
> >     paychecks right now, and not even as a director of the ASF." [2]
> >
> >
> > Info schema.  Info schema is slow when the set of enabled storage
> > plugins is slow to register schemas.  Flaky plugins can be so slow to do
> > this as to make info schema appear broken.  Info schema recently had its
> > filter push down improved so that unneeded schema registration is
> > avoidable [3], and I tested it working in the case of an unreachable
> > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> >
> > In my opinion making today's "on-demand" info schema, which re-fetches
> > schema metadata from sources whenever a query requests it, more
> > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > understand, gain it limit push down support for free, though filter push
> > down seems more likely to be helpful on this kind of data to me.  There
> > is also no reason I can see for info schema not to fetch schema metadata
> > from plugins concurrently.  I don't know if this would be best achieved
> > by explicit programming of the concurrency, or by making the info schema
> > look "splittable" to Drill so that multiple fragments get created.
> >
> > Lastly, I'm generally against introducing any sort of results caching,
> > data or metadata, except in special circumstances such as when the
> > planner can be certain that the underlying data has not changed (seldom
> > or never the case for Drill because it doesn't control its own storage
> > layer).  I think that databases, reliable ones anyway, tend to shun
> > results caching and push it to the application layer, since only that
> > layer can decide what kind of staleness is acceptable, but correct me if
> > I'm wrong.  My conclusion here is that I'd rather do this last, and only
> > after careful consideration.
> >
> > [1] https://infra.apache.org/mailing-list-moderation.html
> > [2] https://www.apache.org/foundation/how-it-works.html#management
> > [3] https://github.com/apache/drill/pull/2388
> >
> > On 2022/02/07 21:05, Ted Dunning wrote:
> > > Another option is to store metadata as data in a distributed data
> store.
> > > For static resources, that can scale very well. For highly dynamic
> > > resources like conventional databases behind JDBC connections, you can
> > > generally delegate metadata to that layer. Performance for delegated
> > > metadata won't necessarily be great, but those systems are usually
> either
> > > small (like Postgress or mySQL) or fading away (like Hive).
> > >
> > > Focusing metadata and planning to a single node will make query
> > concurrency
> > > much worse (and it's already not good).
> > >
> > >
> > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>  wrote:
> > >
> > >> Hi All,
> > >>
> > >> Drill, like all open source projects, exists to serve those that use
> > it. To
> > >> that end, the best contributions come when some company needs a
> feature
> > >> badly enough that it is worth the effort to develop and contribute a
> > >> solution. That's pretty standard, as along as the contribution is
> > general
> > >> purpose. In fact, I hope everyone using Drill in support of their
> > company
> > >> will contribute enhancements back to Drill. If you maintain your own
> > >> private fork, you're not helping the community that provided you with
> > the
> > >> bulk of the code.
> > >>
> > >> For the info schema, I'm at a loss to guess why this would be slow,
> > unless
> > >> every plugin is going off and scanning some external source. Knowing
> > that
> > >> we have a dozen plugins is not slow. Looking at plugin configs is not
> > slow.
> > >> What could be slow is if you want to know about every possible file in
> > HDFS
> > >> or S3, every database and table in an external DB, etc. In this case,
> > the
> > >> bottleneck is either the external system, or the act of querying a
> dozen
> > >> different external systems. Perhap, Charles, you can elaborate on the
> > >> specific scenario you have in mind.
> > >>
> > >> Depending on the core issues, there are various solutions. One
> solution
> > is
> > >> to cache all the external metadata in Drill. That's what Impala did
> with
> > >> the Hive Metastore, and it was a mess. I don't expect Drill would do
> any
> > >> better a job. One reason it was a mess is that, in a production
> system,
> > >> there is a vast amount of metadata. You end up playing all manner of
> > tricks
> > >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> > each
> > >> node has to hold the entire cache. That is memory that can't be used
> to
> > run
> > >> queries. So, to gain performance (for metadata) you give up
> performance
> > (at
> > >> run time.)
> > >>
> > >> One solution is to create a separate metadata cache node. The query
> > goes to
> > >> some Drillbit that acts as Foreman. The Foreman plans the query and
> > >> retrieves the needed metadata from the metadata node. The challenge
> > here is
> > >> that there will be a large amount of metadata transferred, and the
> next
> > >> thing we know we'll want to cache it in each Drillbit, putting us back
> > >> where we started.
> > >>
> > >> So, one can go another step: shift all query planning to the metadata
> > node
> > >> and have a single planner node. The user connects to any Drillbit as
> > >> Foreman, but that Foreman first talks to the "planner/metadata" node
> to
> > >> give it SQL and get back a plan. The Foreman then runs the plan as
> > usual.
> > >> (The Foreman runs the root fragment of the plan, which can be compute
> > >> intensive, so we don't want the planner node to also act as the
> > Foreman.)
> > >> The notion here is that the SQL in/plan out is much smaller than the
> > >> metadata that is needed to compute the plan.
> > >>
> > >> The idea about metadata has long been that Drill should provide a
> > metadata
> > >> API. The Drill metastore should be seen as just one of many metadata
> > >> implementations. The Drill metastore is a "starter solution" for those
> > who
> > >> have not already invested in another solution. (Many shops have HMS or
> > >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> > >> metadata/catalog solutions.)
> > >>
> > >> One can go even further. Consider file and directory pruning in HMS.
> > Every
> > >> tool has to do the exact same thing: given a set of predicates, find
> the
> > >> directories and files that match. Impala does it. Spark must do it.
> > >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> > must
> > >> do it. Maybe someone has come with the One True Metadata Pruner and
> > Drill
> > >> can just delegate the task to that external tool, and get back the
> list
> > of
> > >> directories and files to scan. Far better than building yet another
> > pruner.
> > >> (I think Drill currently has two Parquet metadata pruners, duplicating
> > what
> > >> many other tools have done.)
> > >>
> > >> If we see the source of metadata as plugable, then a shop such as DDR
> > that
> > >> has specific needs (maybe caching those external schemas), can build a
> > >> metadata plugin for that use case. If the solution is general, it can
> be
> > >> contributed to Drill as another metadata option.
> > >>
> > >> In any case, if we can better understand the specific problem you are
> > >> encountering, we can perhaps offer more specific suggestions.
> > >>
> > >> Thanks,
> > >>
> > >> - Paul
> > >>
> > >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>
> wrote:
> > >>
> > >>> Hi Luoc,
> > >>> Thanks for your concern.  Apache projects are often backed
> unofficially
> > >> by
> > >>> a company.  Drill was, for years, backed my MapR as evident by all
> the
> > >> MapR
> > >>> unique code that is still in the Drill codebase. However, since
> MapR's
> > >>> acquisition, I think it is safe to say that Drill really has become a
> > >>> community-driven project.  While some of the committers are
> colleagues
> > of
> > >>> mine at DataDistillr, and Drill is a core part of DataDisitllr, from
> > our
> > >>> perspective, we've really just been focusing on making Drill better
> for
> > >>> everyone as well as building the community of Drill users, regardless
> > of
> > >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> > >> because
> > >>> they go against our business model or tried to steer Drill against
> the
> > >>> community or anything like that.
> > >>>
> > >>> Just for your awareness, there are other OSS projects, including some
> > >>> Apache projects where one company controls everything.  Outside
> > >>> contributions are only accepted if they fit the company's roadmap,
> and
> > >>> there is no real community-building that happens.  From my
> perspective,
> > >>> that is not what I want from Drill.  My personal goal is to build an
> > >> active
> > >>> community of users and developers around an awesome tool.
> > >>>
> > >>> I hope this answers your concerns.
> > >>> Best,
> > >>> -- C
> > >>>
> > >>>
> > >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> > >>>>
> > >>>>
> > >>>> Before we discuss the next release, I would like to explain that
> > Apache
> > >>> project should not be directly linked to a commercial company,
> > otherwise
> > >>> this will affect the motivation of the community to contribute.
> > >>>> Thanks.
> > >>>>
> > >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>  wrote:
> > >>>>>
> > >>>>>  Hello all,
> > >>>>> Firstly, I wanted to thank everyone for all the work that has gone
> > >> into
> > >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> > wanted
> > >>> to start a discussion around topic for Drill 1.21 and that is
> > INFO_SCHEMA
> > >>> improvements.  As my company wades further and further into Drill, it
> > has
> > >>> become apparent that the INFO_SCHEMA could use some attention.  James
> > >>> Turton submitted a PR which was merged into Drill 1.20, but in so
> doing
> > >> he
> > >>> uncovered an entire Pandora's box of other issues which might be
> worth
> > >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > >>> performance related: it can be very slow and also can consume
> > significant
> > >>> resources when executing even basic queries.
> > >>>>> My understanding of how the info schema (IS) works is that when a
> > user
> > >>> executes a query, Drill will attempt to instantiate every enabled
> > storage
> > >>> plugin to discover schemata and other information. As you might
> > imagine,
> > >>> this can be costly.
> > >>>>> So, (and again, this is only meant as a conversation starter), I
> was
> > >>> thinking there are some general ideas as to how we might improve the
> > IS:
> > >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> > >>> limit pushdown in the IS and this could be a relatively quick win for
> > >>> improving IS query performance.
> > >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> > >> could
> > >>> add some sort of schema caching for IS queries, or make better use of
> > the
> > >>> Drill metastore to reduce the number of connections during IS
> queries.
> > >>> Perhaps in combination with the metastore, we could implement some
> sort
> > >> of
> > >>> "metastore first" plan, whereby Drill first hits the metastore for
> > query
> > >>> results and if the limit is reached, we're done.  If not, query the
> > >> storage
> > >>> plugins...
> > >>>>> 3.  Parallelization:  It did not appear to me that Drill
> parallelizes
> > >>> IS queries.   We may be able to add some parallelization which would
> > >>> improve overall speed, but not necessarily reduce overall compute
> cost
> > >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> > >> here,
> > >>> but at least we could get rid of cruft
> > >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing
> this,
> > >>> but the IS seems to obtain a POJO from the storage plugin then write
> > >> these
> > >>> results to old-school Drill vectors.  I'm sure there was a reason it
> > was
> > >>> done this way, (or maybe not) but I have to wonder if there is a more
> > >>> efficient way of obtaining the information from the storage plugin,
> > >> ideally
> > >>> w/o all the object creation.
> > >>>>> These are just some thoughts, and I'm curious as to what the
> > community
> > >>> thinks about this.  Thanks everyone!
> > >>>>> -- C
> > >>>
> >
>

回复:[DISCUSS] Some ideas for Drill 1.21

Posted by "王腾飞(飞腾)" <fe...@cainiao.com>.
Hi Paul,

As I recall, the table will be checked if exist in the schema queried when validate sql node in SqlValidatorImpl, and for some plugins it may also check if column exists in the table.
------------------------------------------------------------------
发件人:Paul Rogers <pa...@gmail.com>
发送时间:2022年2月10日(星期四) 11:06
收件人:dev <de...@drill.apache.org>
主 题:Re: [DISCUSS] Some ideas for Drill 1.21

Hi All,

Would be great to understand the source of the slow planning. Back in the
day, I recall colleagues trying all kinds of things to speed up planning,
but without the time to really figure out where the time went.

I wonder if the two points are related. If most of that planning time is
spent waiting for a plugin metadata, then James' & Charles' issue could
possibly be the cause of the slowness that Ted saw.

James, it is still not clear what plugin metadata is being retrieved, and
when. Now, it is hard to figure that out; that code is complex. Ideally, if
you have a dozen plugins enabled, but query only one, then only that one
should be doing anything. Further, if you're using an external system (like
JDBC), the plugin should query the remote system tables only for the
table(s) you hit in your query. If the code asks ALL plugins for
information, or grabs all tables from the remote system, they, yeah, it's
going to be slow.

Adding per-plugin caching might make sense. For JDBC, say, it is not likely
that the schema of the remote DB changes between queries, so caching for
some amount of time is probably fine. And, if a query asks for an unknown
column, the plugin could refresh metadata to see if the column was just
added. (I was told that Impala users constantly had to run REFRESH METADATA
to pick up new files added to HDFS.)

For the classic, original use case (Parquet or CSV files on an HDFS-like
system), the problem was the need to scan the directory structure at plan
time to figure out which files to scan at run time. For Parquet, the
planner also wants to do Parquet row group pruning, which requires reading
the header of every one of the target files. Since this was slow, Drill
would create a quick & dirty cache, but with large numbers of files, even
reading that cache was slow (and, Drill would rebuild it any time a
directory changed, which greatly slowed planning.)

For that classic use case, saved plans never seemed a win because the
"shape" of the query heavily depended on the WHERE clause: one clause might
hit a small set of files, another hit a large set, and that then throws off
join planning, hash/broadcast exchange decisions and so on.

So, back to the suggestion to start with understanding where the time goes.
Any silly stuff we can just stop doing? Is the cost due to external
factors, such as those cited above? Or, is Calcite itself just heavy
weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
and the cost of planning rises steeply. So, are we fiddling about too much
in the planning process?

One way to test: use a mock data source and plan-time components to
eliminate all external factors. Time various query shapes using EXPLAIN.
How long does Calcite take? If a long time, then we've got a rather
difficult problem as Calcite is hard to fix/replace.

Then, time the plugins of interest. Figure out how to optimize those.

My guess is that the bottleneck won't turn out to be what we think it is.
It usually isn't.

- Paul

On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning <te...@gmail.com> wrote:

> James, you make some good points.
>
> I would generally support what you say except for one special case. I think
> that there is a case to be made to be able to cache query plans in some
> fashion.
>
> The traditional approach to do this is to use "prepared queries" by which
> the application signals that it is willing to trust that a query plan will
> continue to be correct for the duration of its execution. My experience
> (and I think the industry's as well) is that the query plan is more stable
> than the underlying details of the metadata and this level of caching (or
> more) is a very good idea.
>
> In particular, the benefit to Drill is that we have a very expensive query
> planning phase (I have seen numbers in the range 200-800ms routinely) but I
> have seen execution times that are as low as a few 10's of ms. This
> imbalance severely compromises the rate of concurrent querying for fast
> queries. Having some form of plan caching would allow this planning
> overhead to drop to zero in select cases.
>
> I have been unable to even consider working on this problem, but it seems
> that one interesting heuristic would be based on two factors:
> - the ratio of execution time to planning time
> The rationale is that if a query takes much longer to run than to plan, we
> might as well do planning each time. Conversely, if a query takes much less
> time to run than it takes to plan, it is very important to avoid that
> planning time.
>
> - the degree to which recent execution times seem inconsistent with longer
> history
> The rationale here is that a persistent drop in performance for a query is
> a strong indicator that any cached plan is no longer valid and should be
> updated. Conversely, if recent query history is consistent with long-term
> history, that is a vote of confidence for the plan. Furthermore, depending
> on how this is implemented the chance of a false positive change detection
> may increase with very long histories which gives a desirable side effect
> of occasional replanning.
>
> What do other people think about this?
>
>
>
>
> On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:
>
> > My 2c while I wait for 1.20.0 RC1 to upload.
> >
> > I think it's good that we continue to bring every design decision out
> > here to the community like Charles did with this one.  Some relevant
> > excerpts I turned up while zipping around a few ASF docs now.
> >
> >     "Mailing lists are the virtual rooms where ASF communities live,
> >     form and grow. All formal decisions the project's PMC makes need to
> >     have an email thread (possibly with a recorded vote) as an audit
> >     trail that this was an official decision." [1]
> >
> >     "We firmly believe in hats
> >     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> >     role at the ASF is one assigned to you personally, and is bestowed
> >     on you by your peers. It is not tied to your job or current employer
> >     or company." [2]
> >
> >     "Unless they specifically state otherwise, whatever an ASF
> >     participant posts on any mailing list is done /as themselves/. It is
> >     the individual point-of-view, wearing their personal hat and not as
> >     a mouthpiece for whatever company happens to be signing their
> >     paychecks right now, and not even as a director of the ASF." [2]
> >
> >
> > Info schema.  Info schema is slow when the set of enabled storage
> > plugins is slow to register schemas.  Flaky plugins can be so slow to do
> > this as to make info schema appear broken.  Info schema recently had its
> > filter push down improved so that unneeded schema registration is
> > avoidable [3], and I tested it working in the case of an unreachable
> > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> >
> > In my opinion making today's "on-demand" info schema, which re-fetches
> > schema metadata from sources whenever a query requests it, more
> > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > understand, gain it limit push down support for free, though filter push
> > down seems more likely to be helpful on this kind of data to me.  There
> > is also no reason I can see for info schema not to fetch schema metadata
> > from plugins concurrently.  I don't know if this would be best achieved
> > by explicit programming of the concurrency, or by making the info schema
> > look "splittable" to Drill so that multiple fragments get created.
> >
> > Lastly, I'm generally against introducing any sort of results caching,
> > data or metadata, except in special circumstances such as when the
> > planner can be certain that the underlying data has not changed (seldom
> > or never the case for Drill because it doesn't control its own storage
> > layer).  I think that databases, reliable ones anyway, tend to shun
> > results caching and push it to the application layer, since only that
> > layer can decide what kind of staleness is acceptable, but correct me if
> > I'm wrong.  My conclusion here is that I'd rather do this last, and only
> > after careful consideration.
> >
> > [1] https://infra.apache.org/mailing-list-moderation.html
> > [2] https://www.apache.org/foundation/how-it-works.html#management
> > [3] https://github.com/apache/drill/pull/2388
> >
> > On 2022/02/07 21:05, Ted Dunning wrote:
> > > Another option is to store metadata as data in a distributed data
> store.
> > > For static resources, that can scale very well. For highly dynamic
> > > resources like conventional databases behind JDBC connections, you can
> > > generally delegate metadata to that layer. Performance for delegated
> > > metadata won't necessarily be great, but those systems are usually
> either
> > > small (like Postgress or mySQL) or fading away (like Hive).
> > >
> > > Focusing metadata and planning to a single node will make query
> > concurrency
> > > much worse (and it's already not good).
> > >
> > >
> > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>  wrote:
> > >
> > >> Hi All,
> > >>
> > >> Drill, like all open source projects, exists to serve those that use
> > it. To
> > >> that end, the best contributions come when some company needs a
> feature
> > >> badly enough that it is worth the effort to develop and contribute a
> > >> solution. That's pretty standard, as along as the contribution is
> > general
> > >> purpose. In fact, I hope everyone using Drill in support of their
> > company
> > >> will contribute enhancements back to Drill. If you maintain your own
> > >> private fork, you're not helping the community that provided you with
> > the
> > >> bulk of the code.
> > >>
> > >> For the info schema, I'm at a loss to guess why this would be slow,
> > unless
> > >> every plugin is going off and scanning some external source. Knowing
> > that
> > >> we have a dozen plugins is not slow. Looking at plugin configs is not
> > slow.
> > >> What could be slow is if you want to know about every possible file in
> > HDFS
> > >> or S3, every database and table in an external DB, etc. In this case,
> > the
> > >> bottleneck is either the external system, or the act of querying a
> dozen
> > >> different external systems. Perhap, Charles, you can elaborate on the
> > >> specific scenario you have in mind.
> > >>
> > >> Depending on the core issues, there are various solutions. One
> solution
> > is
> > >> to cache all the external metadata in Drill. That's what Impala did
> with
> > >> the Hive Metastore, and it was a mess. I don't expect Drill would do
> any
> > >> better a job. One reason it was a mess is that, in a production
> system,
> > >> there is a vast amount of metadata. You end up playing all manner of
> > tricks
> > >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> > each
> > >> node has to hold the entire cache. That is memory that can't be used
> to
> > run
> > >> queries. So, to gain performance (for metadata) you give up
> performance
> > (at
> > >> run time.)
> > >>
> > >> One solution is to create a separate metadata cache node. The query
> > goes to
> > >> some Drillbit that acts as Foreman. The Foreman plans the query and
> > >> retrieves the needed metadata from the metadata node. The challenge
> > here is
> > >> that there will be a large amount of metadata transferred, and the
> next
> > >> thing we know we'll want to cache it in each Drillbit, putting us back
> > >> where we started.
> > >>
> > >> So, one can go another step: shift all query planning to the metadata
> > node
> > >> and have a single planner node. The user connects to any Drillbit as
> > >> Foreman, but that Foreman first talks to the "planner/metadata" node
> to
> > >> give it SQL and get back a plan. The Foreman then runs the plan as
> > usual.
> > >> (The Foreman runs the root fragment of the plan, which can be compute
> > >> intensive, so we don't want the planner node to also act as the
> > Foreman.)
> > >> The notion here is that the SQL in/plan out is much smaller than the
> > >> metadata that is needed to compute the plan.
> > >>
> > >> The idea about metadata has long been that Drill should provide a
> > metadata
> > >> API. The Drill metastore should be seen as just one of many metadata
> > >> implementations. The Drill metastore is a "starter solution" for those
> > who
> > >> have not already invested in another solution. (Many shops have HMS or
> > >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> > >> metadata/catalog solutions.)
> > >>
> > >> One can go even further. Consider file and directory pruning in HMS.
> > Every
> > >> tool has to do the exact same thing: given a set of predicates, find
> the
> > >> directories and files that match. Impala does it. Spark must do it.
> > >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> > must
> > >> do it. Maybe someone has come with the One True Metadata Pruner and
> > Drill
> > >> can just delegate the task to that external tool, and get back the
> list
> > of
> > >> directories and files to scan. Far better than building yet another
> > pruner.
> > >> (I think Drill currently has two Parquet metadata pruners, duplicating
> > what
> > >> many other tools have done.)
> > >>
> > >> If we see the source of metadata as plugable, then a shop such as DDR
> > that
> > >> has specific needs (maybe caching those external schemas), can build a
> > >> metadata plugin for that use case. If the solution is general, it can
> be
> > >> contributed to Drill as another metadata option.
> > >>
> > >> In any case, if we can better understand the specific problem you are
> > >> encountering, we can perhaps offer more specific suggestions.
> > >>
> > >> Thanks,
> > >>
> > >> - Paul
> > >>
> > >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>
> wrote:
> > >>
> > >>> Hi Luoc,
> > >>> Thanks for your concern.  Apache projects are often backed
> unofficially
> > >> by
> > >>> a company.  Drill was, for years, backed my MapR as evident by all
> the
> > >> MapR
> > >>> unique code that is still in the Drill codebase. However, since
> MapR's
> > >>> acquisition, I think it is safe to say that Drill really has become a
> > >>> community-driven project.  While some of the committers are
> colleagues
> > of
> > >>> mine at DataDistillr, and Drill is a core part of DataDisitllr, from
> > our
> > >>> perspective, we've really just been focusing on making Drill better
> for
> > >>> everyone as well as building the community of Drill users, regardless
> > of
> > >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> > >> because
> > >>> they go against our business model or tried to steer Drill against
> the
> > >>> community or anything like that.
> > >>>
> > >>> Just for your awareness, there are other OSS projects, including some
> > >>> Apache projects where one company controls everything.  Outside
> > >>> contributions are only accepted if they fit the company's roadmap,
> and
> > >>> there is no real community-building that happens.  From my
> perspective,
> > >>> that is not what I want from Drill.  My personal goal is to build an
> > >> active
> > >>> community of users and developers around an awesome tool.
> > >>>
> > >>> I hope this answers your concerns.
> > >>> Best,
> > >>> -- C
> > >>>
> > >>>
> > >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> > >>>>
> > >>>>
> > >>>> Before we discuss the next release, I would like to explain that
> > Apache
> > >>> project should not be directly linked to a commercial company,
> > otherwise
> > >>> this will affect the motivation of the community to contribute.
> > >>>> Thanks.
> > >>>>
> > >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>  wrote:
> > >>>>>
> > >>>>>  Hello all,
> > >>>>> Firstly, I wanted to thank everyone for all the work that has gone
> > >> into
> > >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> > wanted
> > >>> to start a discussion around topic for Drill 1.21 and that is
> > INFO_SCHEMA
> > >>> improvements.  As my company wades further and further into Drill, it
> > has
> > >>> become apparent that the INFO_SCHEMA could use some attention.  James
> > >>> Turton submitted a PR which was merged into Drill 1.20, but in so
> doing
> > >> he
> > >>> uncovered an entire Pandora's box of other issues which might be
> worth
> > >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > >>> performance related: it can be very slow and also can consume
> > significant
> > >>> resources when executing even basic queries.
> > >>>>> My understanding of how the info schema (IS) works is that when a
> > user
> > >>> executes a query, Drill will attempt to instantiate every enabled
> > storage
> > >>> plugin to discover schemata and other information. As you might
> > imagine,
> > >>> this can be costly.
> > >>>>> So, (and again, this is only meant as a conversation starter), I
> was
> > >>> thinking there are some general ideas as to how we might improve the
> > IS:
> > >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> > >>> limit pushdown in the IS and this could be a relatively quick win for
> > >>> improving IS query performance.
> > >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> > >> could
> > >>> add some sort of schema caching for IS queries, or make better use of
> > the
> > >>> Drill metastore to reduce the number of connections during IS
> queries.
> > >>> Perhaps in combination with the metastore, we could implement some
> sort
> > >> of
> > >>> "metastore first" plan, whereby Drill first hits the metastore for
> > query
> > >>> results and if the limit is reached, we're done.  If not, query the
> > >> storage
> > >>> plugins...
> > >>>>> 3.  Parallelization:  It did not appear to me that Drill
> parallelizes
> > >>> IS queries.   We may be able to add some parallelization which would
> > >>> improve overall speed, but not necessarily reduce overall compute
> cost
> > >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> > >> here,
> > >>> but at least we could get rid of cruft
> > >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing
> this,
> > >>> but the IS seems to obtain a POJO from the storage plugin then write
> > >> these
> > >>> results to old-school Drill vectors.  I'm sure there was a reason it
> > was
> > >>> done this way, (or maybe not) but I have to wonder if there is a more
> > >>> efficient way of obtaining the information from the storage plugin,
> > >> ideally
> > >>> w/o all the object creation.
> > >>>>> These are just some thoughts, and I'm curious as to what the
> > community
> > >>> thinks about this.  Thanks everyone!
> > >>>>> -- C
> > >>>
> >
>

回复:[DISCUSS] Some ideas for Drill 1.21

Posted by "王腾飞(飞腾)" <fe...@cainiao.com>.
Maybe we can both have schema/table metadata cache, query plan cache and query result cache, each with an option to switch on/off. As they can solve problems in different level and the one with bigger benefit will have more limited scope. Enable the user to choose if enable them or not depends on their situation may be more flexible. In our production cluster, we already have metadata cache and query result cache, and we are planning to add query plan cache to gain benefit for the same sql with different params.


------------------------------------------------------------------
发件人:Ted Dunning <te...@gmail.com>
发送时间:2022年2月10日(星期四) 13:14
收件人:dev <de...@drill.apache.org>
主 题:Re: [DISCUSS] Some ideas for Drill 1.21

The planning time has been extensively analyzed.

It is inherent in a Volcano-style cost-based optimizer. This is a
branch-and-bound search of an exponential design space.

This bottleneck is very well understood.

Further, it has been accelerated under specialized conditions. As part of
OJAI, there was a limited form of Drill that was included that could work
on specific kinds of tables built into MapR FS. With some rather severe
truncations of the space that the optimizer had to search, the planning
time could be reduced to tens of milliseconds. That was fine for a limited
mission, but some of the really dramatic benefits of Drill on large
queries across complex domains would be impossible with that truncated rule
set.



On Wed, Feb 9, 2022 at 7:06 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi All,
>
> Would be great to understand the source of the slow planning. Back in the
> day, I recall colleagues trying all kinds of things to speed up planning,
> but without the time to really figure out where the time went.
>
> I wonder if the two points are related. If most of that planning time is
> spent waiting for a plugin metadata, then James' & Charles' issue could
> possibly be the cause of the slowness that Ted saw.
>
> James, it is still not clear what plugin metadata is being retrieved, and
> when. Now, it is hard to figure that out; that code is complex. Ideally, if
> you have a dozen plugins enabled, but query only one, then only that one
> should be doing anything. Further, if you're using an external system (like
> JDBC), the plugin should query the remote system tables only for the
> table(s) you hit in your query. If the code asks ALL plugins for
> information, or grabs all tables from the remote system, they, yeah, it's
> going to be slow.
>
> Adding per-plugin caching might make sense. For JDBC, say, it is not likely
> that the schema of the remote DB changes between queries, so caching for
> some amount of time is probably fine. And, if a query asks for an unknown
> column, the plugin could refresh metadata to see if the column was just
> added. (I was told that Impala users constantly had to run REFRESH METADATA
> to pick up new files added to HDFS.)
>
> For the classic, original use case (Parquet or CSV files on an HDFS-like
> system), the problem was the need to scan the directory structure at plan
> time to figure out which files to scan at run time. For Parquet, the
> planner also wants to do Parquet row group pruning, which requires reading
> the header of every one of the target files. Since this was slow, Drill
> would create a quick & dirty cache, but with large numbers of files, even
> reading that cache was slow (and, Drill would rebuild it any time a
> directory changed, which greatly slowed planning.)
>
> For that classic use case, saved plans never seemed a win because the
> "shape" of the query heavily depended on the WHERE clause: one clause might
> hit a small set of files, another hit a large set, and that then throws off
> join planning, hash/broadcast exchange decisions and so on.
>
> So, back to the suggestion to start with understanding where the time goes.
> Any silly stuff we can just stop doing? Is the cost due to external
> factors, such as those cited above? Or, is Calcite itself just heavy
> weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
> and the cost of planning rises steeply. So, are we fiddling about too much
> in the planning process?
>
> One way to test: use a mock data source and plan-time components to
> eliminate all external factors. Time various query shapes using EXPLAIN.
> How long does Calcite take? If a long time, then we've got a rather
> difficult problem as Calcite is hard to fix/replace.
>
> Then, time the plugins of interest. Figure out how to optimize those.
>
> My guess is that the bottleneck won't turn out to be what we think it is.
> It usually isn't.
>
> - Paul
>
> On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning <te...@gmail.com> wrote:
>
> > James, you make some good points.
> >
> > I would generally support what you say except for one special case. I
> think
> > that there is a case to be made to be able to cache query plans in some
> > fashion.
> >
> > The traditional approach to do this is to use "prepared queries" by which
> > the application signals that it is willing to trust that a query plan
> will
> > continue to be correct for the duration of its execution. My experience
> > (and I think the industry's as well) is that the query plan is more
> stable
> > than the underlying details of the metadata and this level of caching (or
> > more) is a very good idea.
> >
> > In particular, the benefit to Drill is that we have a very expensive
> query
> > planning phase (I have seen numbers in the range 200-800ms routinely)
> but I
> > have seen execution times that are as low as a few 10's of ms. This
> > imbalance severely compromises the rate of concurrent querying for fast
> > queries. Having some form of plan caching would allow this planning
> > overhead to drop to zero in select cases.
> >
> > I have been unable to even consider working on this problem, but it seems
> > that one interesting heuristic would be based on two factors:
> > - the ratio of execution time to planning time
> > The rationale is that if a query takes much longer to run than to plan,
> we
> > might as well do planning each time. Conversely, if a query takes much
> less
> > time to run than it takes to plan, it is very important to avoid that
> > planning time.
> >
> > - the degree to which recent execution times seem inconsistent with
> longer
> > history
> > The rationale here is that a persistent drop in performance for a query
> is
> > a strong indicator that any cached plan is no longer valid and should be
> > updated. Conversely, if recent query history is consistent with long-term
> > history, that is a vote of confidence for the plan. Furthermore,
> depending
> > on how this is implemented the chance of a false positive change
> detection
> > may increase with very long histories which gives a desirable side effect
> > of occasional replanning.
> >
> > What do other people think about this?
> >
> >
> >
> >
> > On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:
> >
> > > My 2c while I wait for 1.20.0 RC1 to upload.
> > >
> > > I think it's good that we continue to bring every design decision out
> > > here to the community like Charles did with this one.  Some relevant
> > > excerpts I turned up while zipping around a few ASF docs now.
> > >
> > >     "Mailing lists are the virtual rooms where ASF communities live,
> > >     form and grow. All formal decisions the project's PMC makes need to
> > >     have an email thread (possibly with a recorded vote) as an audit
> > >     trail that this was an official decision." [1]
> > >
> > >     "We firmly believe in hats
> > >     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> > >     role at the ASF is one assigned to you personally, and is bestowed
> > >     on you by your peers. It is not tied to your job or current
> employer
> > >     or company." [2]
> > >
> > >     "Unless they specifically state otherwise, whatever an ASF
> > >     participant posts on any mailing list is done /as themselves/. It
> is
> > >     the individual point-of-view, wearing their personal hat and not as
> > >     a mouthpiece for whatever company happens to be signing their
> > >     paychecks right now, and not even as a director of the ASF." [2]
> > >
> > >
> > > Info schema.  Info schema is slow when the set of enabled storage
> > > plugins is slow to register schemas.  Flaky plugins can be so slow to
> do
> > > this as to make info schema appear broken.  Info schema recently had
> its
> > > filter push down improved so that unneeded schema registration is
> > > avoidable [3], and I tested it working in the case of an unreachable
> > > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> > >
> > > In my opinion making today's "on-demand" info schema, which re-fetches
> > > schema metadata from sources whenever a query requests it, more
> > > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > > understand, gain it limit push down support for free, though filter
> push
> > > down seems more likely to be helpful on this kind of data to me.  There
> > > is also no reason I can see for info schema not to fetch schema
> metadata
> > > from plugins concurrently.  I don't know if this would be best achieved
> > > by explicit programming of the concurrency, or by making the info
> schema
> > > look "splittable" to Drill so that multiple fragments get created.
> > >
> > > Lastly, I'm generally against introducing any sort of results caching,
> > > data or metadata, except in special circumstances such as when the
> > > planner can be certain that the underlying data has not changed (seldom
> > > or never the case for Drill because it doesn't control its own storage
> > > layer).  I think that databases, reliable ones anyway, tend to shun
> > > results caching and push it to the application layer, since only that
> > > layer can decide what kind of staleness is acceptable, but correct me
> if
> > > I'm wrong.  My conclusion here is that I'd rather do this last, and
> only
> > > after careful consideration.
> > >
> > > [1] https://infra.apache.org/mailing-list-moderation.html
> > > [2] https://www.apache.org/foundation/how-it-works.html#management
> > > [3] https://github.com/apache/drill/pull/2388
> > >
> > > On 2022/02/07 21:05, Ted Dunning wrote:
> > > > Another option is to store metadata as data in a distributed data
> > store.
> > > > For static resources, that can scale very well. For highly dynamic
> > > > resources like conventional databases behind JDBC connections, you
> can
> > > > generally delegate metadata to that layer. Performance for delegated
> > > > metadata won't necessarily be great, but those systems are usually
> > either
> > > > small (like Postgress or mySQL) or fading away (like Hive).
> > > >
> > > > Focusing metadata and planning to a single node will make query
> > > concurrency
> > > > much worse (and it's already not good).
> > > >
> > > >
> > > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>
> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> Drill, like all open source projects, exists to serve those that use
> > > it. To
> > > >> that end, the best contributions come when some company needs a
> > feature
> > > >> badly enough that it is worth the effort to develop and contribute a
> > > >> solution. That's pretty standard, as along as the contribution is
> > > general
> > > >> purpose. In fact, I hope everyone using Drill in support of their
> > > company
> > > >> will contribute enhancements back to Drill. If you maintain your own
> > > >> private fork, you're not helping the community that provided you
> with
> > > the
> > > >> bulk of the code.
> > > >>
> > > >> For the info schema, I'm at a loss to guess why this would be slow,
> > > unless
> > > >> every plugin is going off and scanning some external source. Knowing
> > > that
> > > >> we have a dozen plugins is not slow. Looking at plugin configs is
> not
> > > slow.
> > > >> What could be slow is if you want to know about every possible file
> in
> > > HDFS
> > > >> or S3, every database and table in an external DB, etc. In this
> case,
> > > the
> > > >> bottleneck is either the external system, or the act of querying a
> > dozen
> > > >> different external systems. Perhap, Charles, you can elaborate on
> the
> > > >> specific scenario you have in mind.
> > > >>
> > > >> Depending on the core issues, there are various solutions. One
> > solution
> > > is
> > > >> to cache all the external metadata in Drill. That's what Impala did
> > with
> > > >> the Hive Metastore, and it was a mess. I don't expect Drill would do
> > any
> > > >> better a job. One reason it was a mess is that, in a production
> > system,
> > > >> there is a vast amount of metadata. You end up playing all manner of
> > > tricks
> > > >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> > > each
> > > >> node has to hold the entire cache. That is memory that can't be used
> > to
> > > run
> > > >> queries. So, to gain performance (for metadata) you give up
> > performance
> > > (at
> > > >> run time.)
> > > >>
> > > >> One solution is to create a separate metadata cache node. The query
> > > goes to
> > > >> some Drillbit that acts as Foreman. The Foreman plans the query and
> > > >> retrieves the needed metadata from the metadata node. The challenge
> > > here is
> > > >> that there will be a large amount of metadata transferred, and the
> > next
> > > >> thing we know we'll want to cache it in each Drillbit, putting us
> back
> > > >> where we started.
> > > >>
> > > >> So, one can go another step: shift all query planning to the
> metadata
> > > node
> > > >> and have a single planner node. The user connects to any Drillbit as
> > > >> Foreman, but that Foreman first talks to the "planner/metadata" node
> > to
> > > >> give it SQL and get back a plan. The Foreman then runs the plan as
> > > usual.
> > > >> (The Foreman runs the root fragment of the plan, which can be
> compute
> > > >> intensive, so we don't want the planner node to also act as the
> > > Foreman.)
> > > >> The notion here is that the SQL in/plan out is much smaller than the
> > > >> metadata that is needed to compute the plan.
> > > >>
> > > >> The idea about metadata has long been that Drill should provide a
> > > metadata
> > > >> API. The Drill metastore should be seen as just one of many metadata
> > > >> implementations. The Drill metastore is a "starter solution" for
> those
> > > who
> > > >> have not already invested in another solution. (Many shops have HMS
> or
> > > >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> > > >> metadata/catalog solutions.)
> > > >>
> > > >> One can go even further. Consider file and directory pruning in HMS.
> > > Every
> > > >> tool has to do the exact same thing: given a set of predicates, find
> > the
> > > >> directories and files that match. Impala does it. Spark must do it.
> > > >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> > > must
> > > >> do it. Maybe someone has come with the One True Metadata Pruner and
> > > Drill
> > > >> can just delegate the task to that external tool, and get back the
> > list
> > > of
> > > >> directories and files to scan. Far better than building yet another
> > > pruner.
> > > >> (I think Drill currently has two Parquet metadata pruners,
> duplicating
> > > what
> > > >> many other tools have done.)
> > > >>
> > > >> If we see the source of metadata as plugable, then a shop such as
> DDR
> > > that
> > > >> has specific needs (maybe caching those external schemas), can
> build a
> > > >> metadata plugin for that use case. If the solution is general, it
> can
> > be
> > > >> contributed to Drill as another metadata option.
> > > >>
> > > >> In any case, if we can better understand the specific problem you
> are
> > > >> encountering, we can perhaps offer more specific suggestions.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> - Paul
> > > >>
> > > >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi Luoc,
> > > >>> Thanks for your concern.  Apache projects are often backed
> > unofficially
> > > >> by
> > > >>> a company.  Drill was, for years, backed my MapR as evident by all
> > the
> > > >> MapR
> > > >>> unique code that is still in the Drill codebase. However, since
> > MapR's
> > > >>> acquisition, I think it is safe to say that Drill really has
> become a
> > > >>> community-driven project.  While some of the committers are
> > colleagues
> > > of
> > > >>> mine at DataDistillr, and Drill is a core part of DataDisitllr,
> from
> > > our
> > > >>> perspective, we've really just been focusing on making Drill better
> > for
> > > >>> everyone as well as building the community of Drill users,
> regardless
> > > of
> > > >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> > > >> because
> > > >>> they go against our business model or tried to steer Drill against
> > the
> > > >>> community or anything like that.
> > > >>>
> > > >>> Just for your awareness, there are other OSS projects, including
> some
> > > >>> Apache projects where one company controls everything.  Outside
> > > >>> contributions are only accepted if they fit the company's roadmap,
> > and
> > > >>> there is no real community-building that happens.  From my
> > perspective,
> > > >>> that is not what I want from Drill.  My personal goal is to build
> an
> > > >> active
> > > >>> community of users and developers around an awesome tool.
> > > >>>
> > > >>> I hope this answers your concerns.
> > > >>> Best,
> > > >>> -- C
> > > >>>
> > > >>>
> > > >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> > > >>>>
> > > >>>>
> > > >>>> Before we discuss the next release, I would like to explain that
> > > Apache
> > > >>> project should not be directly linked to a commercial company,
> > > otherwise
> > > >>> this will affect the motivation of the community to contribute.
> > > >>>> Thanks.
> > > >>>>
> > > >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>
> wrote:
> > > >>>>>
> > > >>>>>  Hello all,
> > > >>>>> Firstly, I wanted to thank everyone for all the work that has
> gone
> > > >> into
> > > >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> > > wanted
> > > >>> to start a discussion around topic for Drill 1.21 and that is
> > > INFO_SCHEMA
> > > >>> improvements.  As my company wades further and further into Drill,
> it
> > > has
> > > >>> become apparent that the INFO_SCHEMA could use some attention.
> James
> > > >>> Turton submitted a PR which was merged into Drill 1.20, but in so
> > doing
> > > >> he
> > > >>> uncovered an entire Pandora's box of other issues which might be
> > worth
> > > >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > > >>> performance related: it can be very slow and also can consume
> > > significant
> > > >>> resources when executing even basic queries.
> > > >>>>> My understanding of how the info schema (IS) works is that when a
> > > user
> > > >>> executes a query, Drill will attempt to instantiate every enabled
> > > storage
> > > >>> plugin to discover schemata and other information. As you might
> > > imagine,
> > > >>> this can be costly.
> > > >>>>> So, (and again, this is only meant as a conversation starter), I
> > was
> > > >>> thinking there are some general ideas as to how we might improve
> the
> > > IS:
> > > >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is
> no
> > > >>> limit pushdown in the IS and this could be a relatively quick win
> for
> > > >>> improving IS query performance.
> > > >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> > > >> could
> > > >>> add some sort of schema caching for IS queries, or make better use
> of
> > > the
> > > >>> Drill metastore to reduce the number of connections during IS
> > queries.
> > > >>> Perhaps in combination with the metastore, we could implement some
> > sort
> > > >> of
> > > >>> "metastore first" plan, whereby Drill first hits the metastore for
> > > query
> > > >>> results and if the limit is reached, we're done.  If not, query the
> > > >> storage
> > > >>> plugins...
> > > >>>>> 3.  Parallelization:  It did not appear to me that Drill
> > parallelizes
> > > >>> IS queries.   We may be able to add some parallelization which
> would
> > > >>> improve overall speed, but not necessarily reduce overall compute
> > cost
> > > >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> > > >> here,
> > > >>> but at least we could get rid of cruft
> > > >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing
> > this,
> > > >>> but the IS seems to obtain a POJO from the storage plugin then
> write
> > > >> these
> > > >>> results to old-school Drill vectors.  I'm sure there was a reason
> it
> > > was
> > > >>> done this way, (or maybe not) but I have to wonder if there is a
> more
> > > >>> efficient way of obtaining the information from the storage plugin,
> > > >> ideally
> > > >>> w/o all the object creation.
> > > >>>>> These are just some thoughts, and I'm curious as to what the
> > > community
> > > >>> thinks about this.  Thanks everyone!
> > > >>>>> -- C
> > > >>>
> > >
> >
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Ted Dunning <te...@gmail.com>.
The planning time has been extensively analyzed.

It is inherent in a Volcano-style cost-based optimizer. This is a
branch-and-bound search of an exponential design space.

This bottleneck is very well understood.

Further, it has been accelerated under specialized conditions. As part of
OJAI, there was a limited form of Drill that was included that could work
on specific kinds of tables built into MapR FS. With some rather severe
truncations of the space that the optimizer had to search, the planning
time could be reduced to tens of milliseconds. That was fine for a limited
mission, but some of the really dramatic benefits of Drill on large
queries across complex domains would be impossible with that truncated rule
set.



On Wed, Feb 9, 2022 at 7:06 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi All,
>
> Would be great to understand the source of the slow planning. Back in the
> day, I recall colleagues trying all kinds of things to speed up planning,
> but without the time to really figure out where the time went.
>
> I wonder if the two points are related. If most of that planning time is
> spent waiting for a plugin metadata, then James' & Charles' issue could
> possibly be the cause of the slowness that Ted saw.
>
> James, it is still not clear what plugin metadata is being retrieved, and
> when. Now, it is hard to figure that out; that code is complex. Ideally, if
> you have a dozen plugins enabled, but query only one, then only that one
> should be doing anything. Further, if you're using an external system (like
> JDBC), the plugin should query the remote system tables only for the
> table(s) you hit in your query. If the code asks ALL plugins for
> information, or grabs all tables from the remote system, they, yeah, it's
> going to be slow.
>
> Adding per-plugin caching might make sense. For JDBC, say, it is not likely
> that the schema of the remote DB changes between queries, so caching for
> some amount of time is probably fine. And, if a query asks for an unknown
> column, the plugin could refresh metadata to see if the column was just
> added. (I was told that Impala users constantly had to run REFRESH METADATA
> to pick up new files added to HDFS.)
>
> For the classic, original use case (Parquet or CSV files on an HDFS-like
> system), the problem was the need to scan the directory structure at plan
> time to figure out which files to scan at run time. For Parquet, the
> planner also wants to do Parquet row group pruning, which requires reading
> the header of every one of the target files. Since this was slow, Drill
> would create a quick & dirty cache, but with large numbers of files, even
> reading that cache was slow (and, Drill would rebuild it any time a
> directory changed, which greatly slowed planning.)
>
> For that classic use case, saved plans never seemed a win because the
> "shape" of the query heavily depended on the WHERE clause: one clause might
> hit a small set of files, another hit a large set, and that then throws off
> join planning, hash/broadcast exchange decisions and so on.
>
> So, back to the suggestion to start with understanding where the time goes.
> Any silly stuff we can just stop doing? Is the cost due to external
> factors, such as those cited above? Or, is Calcite itself just heavy
> weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
> and the cost of planning rises steeply. So, are we fiddling about too much
> in the planning process?
>
> One way to test: use a mock data source and plan-time components to
> eliminate all external factors. Time various query shapes using EXPLAIN.
> How long does Calcite take? If a long time, then we've got a rather
> difficult problem as Calcite is hard to fix/replace.
>
> Then, time the plugins of interest. Figure out how to optimize those.
>
> My guess is that the bottleneck won't turn out to be what we think it is.
> It usually isn't.
>
> - Paul
>
> On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning <te...@gmail.com> wrote:
>
> > James, you make some good points.
> >
> > I would generally support what you say except for one special case. I
> think
> > that there is a case to be made to be able to cache query plans in some
> > fashion.
> >
> > The traditional approach to do this is to use "prepared queries" by which
> > the application signals that it is willing to trust that a query plan
> will
> > continue to be correct for the duration of its execution. My experience
> > (and I think the industry's as well) is that the query plan is more
> stable
> > than the underlying details of the metadata and this level of caching (or
> > more) is a very good idea.
> >
> > In particular, the benefit to Drill is that we have a very expensive
> query
> > planning phase (I have seen numbers in the range 200-800ms routinely)
> but I
> > have seen execution times that are as low as a few 10's of ms. This
> > imbalance severely compromises the rate of concurrent querying for fast
> > queries. Having some form of plan caching would allow this planning
> > overhead to drop to zero in select cases.
> >
> > I have been unable to even consider working on this problem, but it seems
> > that one interesting heuristic would be based on two factors:
> > - the ratio of execution time to planning time
> > The rationale is that if a query takes much longer to run than to plan,
> we
> > might as well do planning each time. Conversely, if a query takes much
> less
> > time to run than it takes to plan, it is very important to avoid that
> > planning time.
> >
> > - the degree to which recent execution times seem inconsistent with
> longer
> > history
> > The rationale here is that a persistent drop in performance for a query
> is
> > a strong indicator that any cached plan is no longer valid and should be
> > updated. Conversely, if recent query history is consistent with long-term
> > history, that is a vote of confidence for the plan. Furthermore,
> depending
> > on how this is implemented the chance of a false positive change
> detection
> > may increase with very long histories which gives a desirable side effect
> > of occasional replanning.
> >
> > What do other people think about this?
> >
> >
> >
> >
> > On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:
> >
> > > My 2c while I wait for 1.20.0 RC1 to upload.
> > >
> > > I think it's good that we continue to bring every design decision out
> > > here to the community like Charles did with this one.  Some relevant
> > > excerpts I turned up while zipping around a few ASF docs now.
> > >
> > >     "Mailing lists are the virtual rooms where ASF communities live,
> > >     form and grow. All formal decisions the project's PMC makes need to
> > >     have an email thread (possibly with a recorded vote) as an audit
> > >     trail that this was an official decision." [1]
> > >
> > >     "We firmly believe in hats
> > >     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> > >     role at the ASF is one assigned to you personally, and is bestowed
> > >     on you by your peers. It is not tied to your job or current
> employer
> > >     or company." [2]
> > >
> > >     "Unless they specifically state otherwise, whatever an ASF
> > >     participant posts on any mailing list is done /as themselves/. It
> is
> > >     the individual point-of-view, wearing their personal hat and not as
> > >     a mouthpiece for whatever company happens to be signing their
> > >     paychecks right now, and not even as a director of the ASF." [2]
> > >
> > >
> > > Info schema.  Info schema is slow when the set of enabled storage
> > > plugins is slow to register schemas.  Flaky plugins can be so slow to
> do
> > > this as to make info schema appear broken.  Info schema recently had
> its
> > > filter push down improved so that unneeded schema registration is
> > > avoidable [3], and I tested it working in the case of an unreachable
> > > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> > >
> > > In my opinion making today's "on-demand" info schema, which re-fetches
> > > schema metadata from sources whenever a query requests it, more
> > > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > > understand, gain it limit push down support for free, though filter
> push
> > > down seems more likely to be helpful on this kind of data to me.  There
> > > is also no reason I can see for info schema not to fetch schema
> metadata
> > > from plugins concurrently.  I don't know if this would be best achieved
> > > by explicit programming of the concurrency, or by making the info
> schema
> > > look "splittable" to Drill so that multiple fragments get created.
> > >
> > > Lastly, I'm generally against introducing any sort of results caching,
> > > data or metadata, except in special circumstances such as when the
> > > planner can be certain that the underlying data has not changed (seldom
> > > or never the case for Drill because it doesn't control its own storage
> > > layer).  I think that databases, reliable ones anyway, tend to shun
> > > results caching and push it to the application layer, since only that
> > > layer can decide what kind of staleness is acceptable, but correct me
> if
> > > I'm wrong.  My conclusion here is that I'd rather do this last, and
> only
> > > after careful consideration.
> > >
> > > [1] https://infra.apache.org/mailing-list-moderation.html
> > > [2] https://www.apache.org/foundation/how-it-works.html#management
> > > [3] https://github.com/apache/drill/pull/2388
> > >
> > > On 2022/02/07 21:05, Ted Dunning wrote:
> > > > Another option is to store metadata as data in a distributed data
> > store.
> > > > For static resources, that can scale very well. For highly dynamic
> > > > resources like conventional databases behind JDBC connections, you
> can
> > > > generally delegate metadata to that layer. Performance for delegated
> > > > metadata won't necessarily be great, but those systems are usually
> > either
> > > > small (like Postgress or mySQL) or fading away (like Hive).
> > > >
> > > > Focusing metadata and planning to a single node will make query
> > > concurrency
> > > > much worse (and it's already not good).
> > > >
> > > >
> > > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>
> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> Drill, like all open source projects, exists to serve those that use
> > > it. To
> > > >> that end, the best contributions come when some company needs a
> > feature
> > > >> badly enough that it is worth the effort to develop and contribute a
> > > >> solution. That's pretty standard, as along as the contribution is
> > > general
> > > >> purpose. In fact, I hope everyone using Drill in support of their
> > > company
> > > >> will contribute enhancements back to Drill. If you maintain your own
> > > >> private fork, you're not helping the community that provided you
> with
> > > the
> > > >> bulk of the code.
> > > >>
> > > >> For the info schema, I'm at a loss to guess why this would be slow,
> > > unless
> > > >> every plugin is going off and scanning some external source. Knowing
> > > that
> > > >> we have a dozen plugins is not slow. Looking at plugin configs is
> not
> > > slow.
> > > >> What could be slow is if you want to know about every possible file
> in
> > > HDFS
> > > >> or S3, every database and table in an external DB, etc. In this
> case,
> > > the
> > > >> bottleneck is either the external system, or the act of querying a
> > dozen
> > > >> different external systems. Perhap, Charles, you can elaborate on
> the
> > > >> specific scenario you have in mind.
> > > >>
> > > >> Depending on the core issues, there are various solutions. One
> > solution
> > > is
> > > >> to cache all the external metadata in Drill. That's what Impala did
> > with
> > > >> the Hive Metastore, and it was a mess. I don't expect Drill would do
> > any
> > > >> better a job. One reason it was a mess is that, in a production
> > system,
> > > >> there is a vast amount of metadata. You end up playing all manner of
> > > tricks
> > > >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> > > each
> > > >> node has to hold the entire cache. That is memory that can't be used
> > to
> > > run
> > > >> queries. So, to gain performance (for metadata) you give up
> > performance
> > > (at
> > > >> run time.)
> > > >>
> > > >> One solution is to create a separate metadata cache node. The query
> > > goes to
> > > >> some Drillbit that acts as Foreman. The Foreman plans the query and
> > > >> retrieves the needed metadata from the metadata node. The challenge
> > > here is
> > > >> that there will be a large amount of metadata transferred, and the
> > next
> > > >> thing we know we'll want to cache it in each Drillbit, putting us
> back
> > > >> where we started.
> > > >>
> > > >> So, one can go another step: shift all query planning to the
> metadata
> > > node
> > > >> and have a single planner node. The user connects to any Drillbit as
> > > >> Foreman, but that Foreman first talks to the "planner/metadata" node
> > to
> > > >> give it SQL and get back a plan. The Foreman then runs the plan as
> > > usual.
> > > >> (The Foreman runs the root fragment of the plan, which can be
> compute
> > > >> intensive, so we don't want the planner node to also act as the
> > > Foreman.)
> > > >> The notion here is that the SQL in/plan out is much smaller than the
> > > >> metadata that is needed to compute the plan.
> > > >>
> > > >> The idea about metadata has long been that Drill should provide a
> > > metadata
> > > >> API. The Drill metastore should be seen as just one of many metadata
> > > >> implementations. The Drill metastore is a "starter solution" for
> those
> > > who
> > > >> have not already invested in another solution. (Many shops have HMS
> or
> > > >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> > > >> metadata/catalog solutions.)
> > > >>
> > > >> One can go even further. Consider file and directory pruning in HMS.
> > > Every
> > > >> tool has to do the exact same thing: given a set of predicates, find
> > the
> > > >> directories and files that match. Impala does it. Spark must do it.
> > > >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> > > must
> > > >> do it. Maybe someone has come with the One True Metadata Pruner and
> > > Drill
> > > >> can just delegate the task to that external tool, and get back the
> > list
> > > of
> > > >> directories and files to scan. Far better than building yet another
> > > pruner.
> > > >> (I think Drill currently has two Parquet metadata pruners,
> duplicating
> > > what
> > > >> many other tools have done.)
> > > >>
> > > >> If we see the source of metadata as plugable, then a shop such as
> DDR
> > > that
> > > >> has specific needs (maybe caching those external schemas), can
> build a
> > > >> metadata plugin for that use case. If the solution is general, it
> can
> > be
> > > >> contributed to Drill as another metadata option.
> > > >>
> > > >> In any case, if we can better understand the specific problem you
> are
> > > >> encountering, we can perhaps offer more specific suggestions.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> - Paul
> > > >>
> > > >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi Luoc,
> > > >>> Thanks for your concern.  Apache projects are often backed
> > unofficially
> > > >> by
> > > >>> a company.  Drill was, for years, backed my MapR as evident by all
> > the
> > > >> MapR
> > > >>> unique code that is still in the Drill codebase. However, since
> > MapR's
> > > >>> acquisition, I think it is safe to say that Drill really has
> become a
> > > >>> community-driven project.  While some of the committers are
> > colleagues
> > > of
> > > >>> mine at DataDistillr, and Drill is a core part of DataDisitllr,
> from
> > > our
> > > >>> perspective, we've really just been focusing on making Drill better
> > for
> > > >>> everyone as well as building the community of Drill users,
> regardless
> > > of
> > > >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> > > >> because
> > > >>> they go against our business model or tried to steer Drill against
> > the
> > > >>> community or anything like that.
> > > >>>
> > > >>> Just for your awareness, there are other OSS projects, including
> some
> > > >>> Apache projects where one company controls everything.  Outside
> > > >>> contributions are only accepted if they fit the company's roadmap,
> > and
> > > >>> there is no real community-building that happens.  From my
> > perspective,
> > > >>> that is not what I want from Drill.  My personal goal is to build
> an
> > > >> active
> > > >>> community of users and developers around an awesome tool.
> > > >>>
> > > >>> I hope this answers your concerns.
> > > >>> Best,
> > > >>> -- C
> > > >>>
> > > >>>
> > > >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> > > >>>>
> > > >>>>
> > > >>>> Before we discuss the next release, I would like to explain that
> > > Apache
> > > >>> project should not be directly linked to a commercial company,
> > > otherwise
> > > >>> this will affect the motivation of the community to contribute.
> > > >>>> Thanks.
> > > >>>>
> > > >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>
> wrote:
> > > >>>>>
> > > >>>>> Hello all,
> > > >>>>> Firstly, I wanted to thank everyone for all the work that has
> gone
> > > >> into
> > > >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> > > wanted
> > > >>> to start a discussion around topic for Drill 1.21 and that is
> > > INFO_SCHEMA
> > > >>> improvements.  As my company wades further and further into Drill,
> it
> > > has
> > > >>> become apparent that the INFO_SCHEMA could use some attention.
> James
> > > >>> Turton submitted a PR which was merged into Drill 1.20, but in so
> > doing
> > > >> he
> > > >>> uncovered an entire Pandora's box of other issues which might be
> > worth
> > > >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > > >>> performance related: it can be very slow and also can consume
> > > significant
> > > >>> resources when executing even basic queries.
> > > >>>>> My understanding of how the info schema (IS) works is that when a
> > > user
> > > >>> executes a query, Drill will attempt to instantiate every enabled
> > > storage
> > > >>> plugin to discover schemata and other information. As you might
> > > imagine,
> > > >>> this can be costly.
> > > >>>>> So, (and again, this is only meant as a conversation starter), I
> > was
> > > >>> thinking there are some general ideas as to how we might improve
> the
> > > IS:
> > > >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is
> no
> > > >>> limit pushdown in the IS and this could be a relatively quick win
> for
> > > >>> improving IS query performance.
> > > >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> > > >> could
> > > >>> add some sort of schema caching for IS queries, or make better use
> of
> > > the
> > > >>> Drill metastore to reduce the number of connections during IS
> > queries.
> > > >>> Perhaps in combination with the metastore, we could implement some
> > sort
> > > >> of
> > > >>> "metastore first" plan, whereby Drill first hits the metastore for
> > > query
> > > >>> results and if the limit is reached, we're done.  If not, query the
> > > >> storage
> > > >>> plugins...
> > > >>>>> 3.  Parallelization:  It did not appear to me that Drill
> > parallelizes
> > > >>> IS queries.   We may be able to add some parallelization which
> would
> > > >>> improve overall speed, but not necessarily reduce overall compute
> > cost
> > > >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> > > >> here,
> > > >>> but at least we could get rid of cruft
> > > >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing
> > this,
> > > >>> but the IS seems to obtain a POJO from the storage plugin then
> write
> > > >> these
> > > >>> results to old-school Drill vectors.  I'm sure there was a reason
> it
> > > was
> > > >>> done this way, (or maybe not) but I have to wonder if there is a
> more
> > > >>> efficient way of obtaining the information from the storage plugin,
> > > >> ideally
> > > >>> w/o all the object creation.
> > > >>>>> These are just some thoughts, and I'm curious as to what the
> > > community
> > > >>> thinks about this.  Thanks everyone!
> > > >>>>> -- C
> > > >>>
> > >
> >
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Paul Rogers <pa...@gmail.com>.
Hi All,

Would be great to understand the source of the slow planning. Back in the
day, I recall colleagues trying all kinds of things to speed up planning,
but without the time to really figure out where the time went.

I wonder if the two points are related. If most of that planning time is
spent waiting for a plugin metadata, then James' & Charles' issue could
possibly be the cause of the slowness that Ted saw.

James, it is still not clear what plugin metadata is being retrieved, and
when. Now, it is hard to figure that out; that code is complex. Ideally, if
you have a dozen plugins enabled, but query only one, then only that one
should be doing anything. Further, if you're using an external system (like
JDBC), the plugin should query the remote system tables only for the
table(s) you hit in your query. If the code asks ALL plugins for
information, or grabs all tables from the remote system, they, yeah, it's
going to be slow.

Adding per-plugin caching might make sense. For JDBC, say, it is not likely
that the schema of the remote DB changes between queries, so caching for
some amount of time is probably fine. And, if a query asks for an unknown
column, the plugin could refresh metadata to see if the column was just
added. (I was told that Impala users constantly had to run REFRESH METADATA
to pick up new files added to HDFS.)

For the classic, original use case (Parquet or CSV files on an HDFS-like
system), the problem was the need to scan the directory structure at plan
time to figure out which files to scan at run time. For Parquet, the
planner also wants to do Parquet row group pruning, which requires reading
the header of every one of the target files. Since this was slow, Drill
would create a quick & dirty cache, but with large numbers of files, even
reading that cache was slow (and, Drill would rebuild it any time a
directory changed, which greatly slowed planning.)

For that classic use case, saved plans never seemed a win because the
"shape" of the query heavily depended on the WHERE clause: one clause might
hit a small set of files, another hit a large set, and that then throws off
join planning, hash/broadcast exchange decisions and so on.

So, back to the suggestion to start with understanding where the time goes.
Any silly stuff we can just stop doing? Is the cost due to external
factors, such as those cited above? Or, is Calcite itself just heavy
weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
and the cost of planning rises steeply. So, are we fiddling about too much
in the planning process?

One way to test: use a mock data source and plan-time components to
eliminate all external factors. Time various query shapes using EXPLAIN.
How long does Calcite take? If a long time, then we've got a rather
difficult problem as Calcite is hard to fix/replace.

Then, time the plugins of interest. Figure out how to optimize those.

My guess is that the bottleneck won't turn out to be what we think it is.
It usually isn't.

- Paul

On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning <te...@gmail.com> wrote:

> James, you make some good points.
>
> I would generally support what you say except for one special case. I think
> that there is a case to be made to be able to cache query plans in some
> fashion.
>
> The traditional approach to do this is to use "prepared queries" by which
> the application signals that it is willing to trust that a query plan will
> continue to be correct for the duration of its execution. My experience
> (and I think the industry's as well) is that the query plan is more stable
> than the underlying details of the metadata and this level of caching (or
> more) is a very good idea.
>
> In particular, the benefit to Drill is that we have a very expensive query
> planning phase (I have seen numbers in the range 200-800ms routinely) but I
> have seen execution times that are as low as a few 10's of ms. This
> imbalance severely compromises the rate of concurrent querying for fast
> queries. Having some form of plan caching would allow this planning
> overhead to drop to zero in select cases.
>
> I have been unable to even consider working on this problem, but it seems
> that one interesting heuristic would be based on two factors:
> - the ratio of execution time to planning time
> The rationale is that if a query takes much longer to run than to plan, we
> might as well do planning each time. Conversely, if a query takes much less
> time to run than it takes to plan, it is very important to avoid that
> planning time.
>
> - the degree to which recent execution times seem inconsistent with longer
> history
> The rationale here is that a persistent drop in performance for a query is
> a strong indicator that any cached plan is no longer valid and should be
> updated. Conversely, if recent query history is consistent with long-term
> history, that is a vote of confidence for the plan. Furthermore, depending
> on how this is implemented the chance of a false positive change detection
> may increase with very long histories which gives a desirable side effect
> of occasional replanning.
>
> What do other people think about this?
>
>
>
>
> On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:
>
> > My 2c while I wait for 1.20.0 RC1 to upload.
> >
> > I think it's good that we continue to bring every design decision out
> > here to the community like Charles did with this one.  Some relevant
> > excerpts I turned up while zipping around a few ASF docs now.
> >
> >     "Mailing lists are the virtual rooms where ASF communities live,
> >     form and grow. All formal decisions the project's PMC makes need to
> >     have an email thread (possibly with a recorded vote) as an audit
> >     trail that this was an official decision." [1]
> >
> >     "We firmly believe in hats
> >     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
> >     role at the ASF is one assigned to you personally, and is bestowed
> >     on you by your peers. It is not tied to your job or current employer
> >     or company." [2]
> >
> >     "Unless they specifically state otherwise, whatever an ASF
> >     participant posts on any mailing list is done /as themselves/. It is
> >     the individual point-of-view, wearing their personal hat and not as
> >     a mouthpiece for whatever company happens to be signing their
> >     paychecks right now, and not even as a director of the ASF." [2]
> >
> >
> > Info schema.  Info schema is slow when the set of enabled storage
> > plugins is slow to register schemas.  Flaky plugins can be so slow to do
> > this as to make info schema appear broken.  Info schema recently had its
> > filter push down improved so that unneeded schema registration is
> > avoidable [3], and I tested it working in the case of an unreachable
> > active PostgreSQL plugin (provided my WHERE clause excluded said pg).
> >
> > In my opinion making today's "on-demand" info schema, which re-fetches
> > schema metadata from sources whenever a query requests it, more
> > efficient is the right place to start.  Rewriting it on EVF2 would, I
> > understand, gain it limit push down support for free, though filter push
> > down seems more likely to be helpful on this kind of data to me.  There
> > is also no reason I can see for info schema not to fetch schema metadata
> > from plugins concurrently.  I don't know if this would be best achieved
> > by explicit programming of the concurrency, or by making the info schema
> > look "splittable" to Drill so that multiple fragments get created.
> >
> > Lastly, I'm generally against introducing any sort of results caching,
> > data or metadata, except in special circumstances such as when the
> > planner can be certain that the underlying data has not changed (seldom
> > or never the case for Drill because it doesn't control its own storage
> > layer).  I think that databases, reliable ones anyway, tend to shun
> > results caching and push it to the application layer, since only that
> > layer can decide what kind of staleness is acceptable, but correct me if
> > I'm wrong.  My conclusion here is that I'd rather do this last, and only
> > after careful consideration.
> >
> > [1] https://infra.apache.org/mailing-list-moderation.html
> > [2] https://www.apache.org/foundation/how-it-works.html#management
> > [3] https://github.com/apache/drill/pull/2388
> >
> > On 2022/02/07 21:05, Ted Dunning wrote:
> > > Another option is to store metadata as data in a distributed data
> store.
> > > For static resources, that can scale very well. For highly dynamic
> > > resources like conventional databases behind JDBC connections, you can
> > > generally delegate metadata to that layer. Performance for delegated
> > > metadata won't necessarily be great, but those systems are usually
> either
> > > small (like Postgress or mySQL) or fading away (like Hive).
> > >
> > > Focusing metadata and planning to a single node will make query
> > concurrency
> > > much worse (and it's already not good).
> > >
> > >
> > > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>  wrote:
> > >
> > >> Hi All,
> > >>
> > >> Drill, like all open source projects, exists to serve those that use
> > it. To
> > >> that end, the best contributions come when some company needs a
> feature
> > >> badly enough that it is worth the effort to develop and contribute a
> > >> solution. That's pretty standard, as along as the contribution is
> > general
> > >> purpose. In fact, I hope everyone using Drill in support of their
> > company
> > >> will contribute enhancements back to Drill. If you maintain your own
> > >> private fork, you're not helping the community that provided you with
> > the
> > >> bulk of the code.
> > >>
> > >> For the info schema, I'm at a loss to guess why this would be slow,
> > unless
> > >> every plugin is going off and scanning some external source. Knowing
> > that
> > >> we have a dozen plugins is not slow. Looking at plugin configs is not
> > slow.
> > >> What could be slow is if you want to know about every possible file in
> > HDFS
> > >> or S3, every database and table in an external DB, etc. In this case,
> > the
> > >> bottleneck is either the external system, or the act of querying a
> dozen
> > >> different external systems. Perhap, Charles, you can elaborate on the
> > >> specific scenario you have in mind.
> > >>
> > >> Depending on the core issues, there are various solutions. One
> solution
> > is
> > >> to cache all the external metadata in Drill. That's what Impala did
> with
> > >> the Hive Metastore, and it was a mess. I don't expect Drill would do
> any
> > >> better a job. One reason it was a mess is that, in a production
> system,
> > >> there is a vast amount of metadata. You end up playing all manner of
> > tricks
> > >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> > each
> > >> node has to hold the entire cache. That is memory that can't be used
> to
> > run
> > >> queries. So, to gain performance (for metadata) you give up
> performance
> > (at
> > >> run time.)
> > >>
> > >> One solution is to create a separate metadata cache node. The query
> > goes to
> > >> some Drillbit that acts as Foreman. The Foreman plans the query and
> > >> retrieves the needed metadata from the metadata node. The challenge
> > here is
> > >> that there will be a large amount of metadata transferred, and the
> next
> > >> thing we know we'll want to cache it in each Drillbit, putting us back
> > >> where we started.
> > >>
> > >> So, one can go another step: shift all query planning to the metadata
> > node
> > >> and have a single planner node. The user connects to any Drillbit as
> > >> Foreman, but that Foreman first talks to the "planner/metadata" node
> to
> > >> give it SQL and get back a plan. The Foreman then runs the plan as
> > usual.
> > >> (The Foreman runs the root fragment of the plan, which can be compute
> > >> intensive, so we don't want the planner node to also act as the
> > Foreman.)
> > >> The notion here is that the SQL in/plan out is much smaller than the
> > >> metadata that is needed to compute the plan.
> > >>
> > >> The idea about metadata has long been that Drill should provide a
> > metadata
> > >> API. The Drill metastore should be seen as just one of many metadata
> > >> implementations. The Drill metastore is a "starter solution" for those
> > who
> > >> have not already invested in another solution. (Many shops have HMS or
> > >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> > >> metadata/catalog solutions.)
> > >>
> > >> One can go even further. Consider file and directory pruning in HMS.
> > Every
> > >> tool has to do the exact same thing: given a set of predicates, find
> the
> > >> directories and files that match. Impala does it. Spark must do it.
> > >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> > must
> > >> do it. Maybe someone has come with the One True Metadata Pruner and
> > Drill
> > >> can just delegate the task to that external tool, and get back the
> list
> > of
> > >> directories and files to scan. Far better than building yet another
> > pruner.
> > >> (I think Drill currently has two Parquet metadata pruners, duplicating
> > what
> > >> many other tools have done.)
> > >>
> > >> If we see the source of metadata as plugable, then a shop such as DDR
> > that
> > >> has specific needs (maybe caching those external schemas), can build a
> > >> metadata plugin for that use case. If the solution is general, it can
> be
> > >> contributed to Drill as another metadata option.
> > >>
> > >> In any case, if we can better understand the specific problem you are
> > >> encountering, we can perhaps offer more specific suggestions.
> > >>
> > >> Thanks,
> > >>
> > >> - Paul
> > >>
> > >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>
> wrote:
> > >>
> > >>> Hi Luoc,
> > >>> Thanks for your concern.  Apache projects are often backed
> unofficially
> > >> by
> > >>> a company.  Drill was, for years, backed my MapR as evident by all
> the
> > >> MapR
> > >>> unique code that is still in the Drill codebase. However, since
> MapR's
> > >>> acquisition, I think it is safe to say that Drill really has become a
> > >>> community-driven project.  While some of the committers are
> colleagues
> > of
> > >>> mine at DataDistillr, and Drill is a core part of DataDisitllr, from
> > our
> > >>> perspective, we've really just been focusing on making Drill better
> for
> > >>> everyone as well as building the community of Drill users, regardless
> > of
> > >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> > >> because
> > >>> they go against our business model or tried to steer Drill against
> the
> > >>> community or anything like that.
> > >>>
> > >>> Just for your awareness, there are other OSS projects, including some
> > >>> Apache projects where one company controls everything.  Outside
> > >>> contributions are only accepted if they fit the company's roadmap,
> and
> > >>> there is no real community-building that happens.  From my
> perspective,
> > >>> that is not what I want from Drill.  My personal goal is to build an
> > >> active
> > >>> community of users and developers around an awesome tool.
> > >>>
> > >>> I hope this answers your concerns.
> > >>> Best,
> > >>> -- C
> > >>>
> > >>>
> > >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> > >>>>
> > >>>>
> > >>>> Before we discuss the next release, I would like to explain that
> > Apache
> > >>> project should not be directly linked to a commercial company,
> > otherwise
> > >>> this will affect the motivation of the community to contribute.
> > >>>> Thanks.
> > >>>>
> > >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>  wrote:
> > >>>>>
> > >>>>> Hello all,
> > >>>>> Firstly, I wanted to thank everyone for all the work that has gone
> > >> into
> > >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> > wanted
> > >>> to start a discussion around topic for Drill 1.21 and that is
> > INFO_SCHEMA
> > >>> improvements.  As my company wades further and further into Drill, it
> > has
> > >>> become apparent that the INFO_SCHEMA could use some attention.  James
> > >>> Turton submitted a PR which was merged into Drill 1.20, but in so
> doing
> > >> he
> > >>> uncovered an entire Pandora's box of other issues which might be
> worth
> > >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > >>> performance related: it can be very slow and also can consume
> > significant
> > >>> resources when executing even basic queries.
> > >>>>> My understanding of how the info schema (IS) works is that when a
> > user
> > >>> executes a query, Drill will attempt to instantiate every enabled
> > storage
> > >>> plugin to discover schemata and other information. As you might
> > imagine,
> > >>> this can be costly.
> > >>>>> So, (and again, this is only meant as a conversation starter), I
> was
> > >>> thinking there are some general ideas as to how we might improve the
> > IS:
> > >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> > >>> limit pushdown in the IS and this could be a relatively quick win for
> > >>> improving IS query performance.
> > >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> > >> could
> > >>> add some sort of schema caching for IS queries, or make better use of
> > the
> > >>> Drill metastore to reduce the number of connections during IS
> queries.
> > >>> Perhaps in combination with the metastore, we could implement some
> sort
> > >> of
> > >>> "metastore first" plan, whereby Drill first hits the metastore for
> > query
> > >>> results and if the limit is reached, we're done.  If not, query the
> > >> storage
> > >>> plugins...
> > >>>>> 3.  Parallelization:  It did not appear to me that Drill
> parallelizes
> > >>> IS queries.   We may be able to add some parallelization which would
> > >>> improve overall speed, but not necessarily reduce overall compute
> cost
> > >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> > >> here,
> > >>> but at least we could get rid of cruft
> > >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing
> this,
> > >>> but the IS seems to obtain a POJO from the storage plugin then write
> > >> these
> > >>> results to old-school Drill vectors.  I'm sure there was a reason it
> > was
> > >>> done this way, (or maybe not) but I have to wonder if there is a more
> > >>> efficient way of obtaining the information from the storage plugin,
> > >> ideally
> > >>> w/o all the object creation.
> > >>>>> These are just some thoughts, and I'm curious as to what the
> > community
> > >>> thinks about this.  Thanks everyone!
> > >>>>> -- C
> > >>>
> >
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Ted Dunning <te...@gmail.com>.
James, you make some good points.

I would generally support what you say except for one special case. I think
that there is a case to be made to be able to cache query plans in some
fashion.

The traditional approach to do this is to use "prepared queries" by which
the application signals that it is willing to trust that a query plan will
continue to be correct for the duration of its execution. My experience
(and I think the industry's as well) is that the query plan is more stable
than the underlying details of the metadata and this level of caching (or
more) is a very good idea.

In particular, the benefit to Drill is that we have a very expensive query
planning phase (I have seen numbers in the range 200-800ms routinely) but I
have seen execution times that are as low as a few 10's of ms. This
imbalance severely compromises the rate of concurrent querying for fast
queries. Having some form of plan caching would allow this planning
overhead to drop to zero in select cases.

I have been unable to even consider working on this problem, but it seems
that one interesting heuristic would be based on two factors:
- the ratio of execution time to planning time
The rationale is that if a query takes much longer to run than to plan, we
might as well do planning each time. Conversely, if a query takes much less
time to run than it takes to plan, it is very important to avoid that
planning time.

- the degree to which recent execution times seem inconsistent with longer
history
The rationale here is that a persistent drop in performance for a query is
a strong indicator that any cached plan is no longer valid and should be
updated. Conversely, if recent query history is consistent with long-term
history, that is a vote of confidence for the plan. Furthermore, depending
on how this is implemented the chance of a false positive change detection
may increase with very long histories which gives a desirable side effect
of occasional replanning.

What do other people think about this?




On Tue, Feb 8, 2022 at 4:48 AM James Turton <dz...@apache.org> wrote:

> My 2c while I wait for 1.20.0 RC1 to upload.
>
> I think it's good that we continue to bring every design decision out
> here to the community like Charles did with this one.  Some relevant
> excerpts I turned up while zipping around a few ASF docs now.
>
>     "Mailing lists are the virtual rooms where ASF communities live,
>     form and grow. All formal decisions the project's PMC makes need to
>     have an email thread (possibly with a recorded vote) as an audit
>     trail that this was an official decision." [1]
>
>     "We firmly believe in hats
>     <https://www.apache.org/foundation/how-it-works.html#hats>. Your
>     role at the ASF is one assigned to you personally, and is bestowed
>     on you by your peers. It is not tied to your job or current employer
>     or company." [2]
>
>     "Unless they specifically state otherwise, whatever an ASF
>     participant posts on any mailing list is done /as themselves/. It is
>     the individual point-of-view, wearing their personal hat and not as
>     a mouthpiece for whatever company happens to be signing their
>     paychecks right now, and not even as a director of the ASF." [2]
>
>
> Info schema.  Info schema is slow when the set of enabled storage
> plugins is slow to register schemas.  Flaky plugins can be so slow to do
> this as to make info schema appear broken.  Info schema recently had its
> filter push down improved so that unneeded schema registration is
> avoidable [3], and I tested it working in the case of an unreachable
> active PostgreSQL plugin (provided my WHERE clause excluded said pg).
>
> In my opinion making today's "on-demand" info schema, which re-fetches
> schema metadata from sources whenever a query requests it, more
> efficient is the right place to start.  Rewriting it on EVF2 would, I
> understand, gain it limit push down support for free, though filter push
> down seems more likely to be helpful on this kind of data to me.  There
> is also no reason I can see for info schema not to fetch schema metadata
> from plugins concurrently.  I don't know if this would be best achieved
> by explicit programming of the concurrency, or by making the info schema
> look "splittable" to Drill so that multiple fragments get created.
>
> Lastly, I'm generally against introducing any sort of results caching,
> data or metadata, except in special circumstances such as when the
> planner can be certain that the underlying data has not changed (seldom
> or never the case for Drill because it doesn't control its own storage
> layer).  I think that databases, reliable ones anyway, tend to shun
> results caching and push it to the application layer, since only that
> layer can decide what kind of staleness is acceptable, but correct me if
> I'm wrong.  My conclusion here is that I'd rather do this last, and only
> after careful consideration.
>
> [1] https://infra.apache.org/mailing-list-moderation.html
> [2] https://www.apache.org/foundation/how-it-works.html#management
> [3] https://github.com/apache/drill/pull/2388
>
> On 2022/02/07 21:05, Ted Dunning wrote:
> > Another option is to store metadata as data in a distributed data store.
> > For static resources, that can scale very well. For highly dynamic
> > resources like conventional databases behind JDBC connections, you can
> > generally delegate metadata to that layer. Performance for delegated
> > metadata won't necessarily be great, but those systems are usually either
> > small (like Postgress or mySQL) or fading away (like Hive).
> >
> > Focusing metadata and planning to a single node will make query
> concurrency
> > much worse (and it's already not good).
> >
> >
> > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>  wrote:
> >
> >> Hi All,
> >>
> >> Drill, like all open source projects, exists to serve those that use
> it. To
> >> that end, the best contributions come when some company needs a feature
> >> badly enough that it is worth the effort to develop and contribute a
> >> solution. That's pretty standard, as along as the contribution is
> general
> >> purpose. In fact, I hope everyone using Drill in support of their
> company
> >> will contribute enhancements back to Drill. If you maintain your own
> >> private fork, you're not helping the community that provided you with
> the
> >> bulk of the code.
> >>
> >> For the info schema, I'm at a loss to guess why this would be slow,
> unless
> >> every plugin is going off and scanning some external source. Knowing
> that
> >> we have a dozen plugins is not slow. Looking at plugin configs is not
> slow.
> >> What could be slow is if you want to know about every possible file in
> HDFS
> >> or S3, every database and table in an external DB, etc. In this case,
> the
> >> bottleneck is either the external system, or the act of querying a dozen
> >> different external systems. Perhap, Charles, you can elaborate on the
> >> specific scenario you have in mind.
> >>
> >> Depending on the core issues, there are various solutions. One solution
> is
> >> to cache all the external metadata in Drill. That's what Impala did with
> >> the Hive Metastore, and it was a mess. I don't expect Drill would do any
> >> better a job. One reason it was a mess is that, in a production system,
> >> there is a vast amount of metadata. You end up playing all manner of
> tricks
> >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> each
> >> node has to hold the entire cache. That is memory that can't be used to
> run
> >> queries. So, to gain performance (for metadata) you give up performance
> (at
> >> run time.)
> >>
> >> One solution is to create a separate metadata cache node. The query
> goes to
> >> some Drillbit that acts as Foreman. The Foreman plans the query and
> >> retrieves the needed metadata from the metadata node. The challenge
> here is
> >> that there will be a large amount of metadata transferred, and the next
> >> thing we know we'll want to cache it in each Drillbit, putting us back
> >> where we started.
> >>
> >> So, one can go another step: shift all query planning to the metadata
> node
> >> and have a single planner node. The user connects to any Drillbit as
> >> Foreman, but that Foreman first talks to the "planner/metadata" node to
> >> give it SQL and get back a plan. The Foreman then runs the plan as
> usual.
> >> (The Foreman runs the root fragment of the plan, which can be compute
> >> intensive, so we don't want the planner node to also act as the
> Foreman.)
> >> The notion here is that the SQL in/plan out is much smaller than the
> >> metadata that is needed to compute the plan.
> >>
> >> The idea about metadata has long been that Drill should provide a
> metadata
> >> API. The Drill metastore should be seen as just one of many metadata
> >> implementations. The Drill metastore is a "starter solution" for those
> who
> >> have not already invested in another solution. (Many shops have HMS or
> >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> >> metadata/catalog solutions.)
> >>
> >> One can go even further. Consider file and directory pruning in HMS.
> Every
> >> tool has to do the exact same thing: given a set of predicates, find the
> >> directories and files that match. Impala does it. Spark must do it.
> >> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode
> must
> >> do it. Maybe someone has come with the One True Metadata Pruner and
> Drill
> >> can just delegate the task to that external tool, and get back the list
> of
> >> directories and files to scan. Far better than building yet another
> pruner.
> >> (I think Drill currently has two Parquet metadata pruners, duplicating
> what
> >> many other tools have done.)
> >>
> >> If we see the source of metadata as plugable, then a shop such as DDR
> that
> >> has specific needs (maybe caching those external schemas), can build a
> >> metadata plugin for that use case. If the solution is general, it can be
> >> contributed to Drill as another metadata option.
> >>
> >> In any case, if we can better understand the specific problem you are
> >> encountering, we can perhaps offer more specific suggestions.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>  wrote:
> >>
> >>> Hi Luoc,
> >>> Thanks for your concern.  Apache projects are often backed unofficially
> >> by
> >>> a company.  Drill was, for years, backed my MapR as evident by all the
> >> MapR
> >>> unique code that is still in the Drill codebase. However, since MapR's
> >>> acquisition, I think it is safe to say that Drill really has become a
> >>> community-driven project.  While some of the committers are colleagues
> of
> >>> mine at DataDistillr, and Drill is a core part of DataDisitllr, from
> our
> >>> perspective, we've really just been focusing on making Drill better for
> >>> everyone as well as building the community of Drill users, regardless
> of
> >>> whether they use DataDistillr or not.  We haven't rejected any PRs
> >> because
> >>> they go against our business model or tried to steer Drill against the
> >>> community or anything like that.
> >>>
> >>> Just for your awareness, there are other OSS projects, including some
> >>> Apache projects where one company controls everything.  Outside
> >>> contributions are only accepted if they fit the company's roadmap, and
> >>> there is no real community-building that happens.  From my perspective,
> >>> that is not what I want from Drill.  My personal goal is to build an
> >> active
> >>> community of users and developers around an awesome tool.
> >>>
> >>> I hope this answers your concerns.
> >>> Best,
> >>> -- C
> >>>
> >>>
> >>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
> >>>>
> >>>>
> >>>> Before we discuss the next release, I would like to explain that
> Apache
> >>> project should not be directly linked to a commercial company,
> otherwise
> >>> this will affect the motivation of the community to contribute.
> >>>> Thanks.
> >>>>
> >>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>  wrote:
> >>>>>
> >>>>> Hello all,
> >>>>> Firstly, I wanted to thank everyone for all the work that has gone
> >> into
> >>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I
> wanted
> >>> to start a discussion around topic for Drill 1.21 and that is
> INFO_SCHEMA
> >>> improvements.  As my company wades further and further into Drill, it
> has
> >>> become apparent that the INFO_SCHEMA could use some attention.  James
> >>> Turton submitted a PR which was merged into Drill 1.20, but in so doing
> >> he
> >>> uncovered an entire Pandora's box of other issues which might be worth
> >>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> >>> performance related: it can be very slow and also can consume
> significant
> >>> resources when executing even basic queries.
> >>>>> My understanding of how the info schema (IS) works is that when a
> user
> >>> executes a query, Drill will attempt to instantiate every enabled
> storage
> >>> plugin to discover schemata and other information. As you might
> imagine,
> >>> this can be costly.
> >>>>> So, (and again, this is only meant as a conversation starter), I was
> >>> thinking there are some general ideas as to how we might improve the
> IS:
> >>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> >>> limit pushdown in the IS and this could be a relatively quick win for
> >>> improving IS query performance.
> >>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
> >> could
> >>> add some sort of schema caching for IS queries, or make better use of
> the
> >>> Drill metastore to reduce the number of connections during IS queries.
> >>> Perhaps in combination with the metastore, we could implement some sort
> >> of
> >>> "metastore first" plan, whereby Drill first hits the metastore for
> query
> >>> results and if the limit is reached, we're done.  If not, query the
> >> storage
> >>> plugins...
> >>>>> 3.  Parallelization:  It did not appear to me that Drill parallelizes
> >>> IS queries.   We may be able to add some parallelization which would
> >>> improve overall speed, but not necessarily reduce overall compute cost
> >>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> >> here,
> >>> but at least we could get rid of cruft
> >>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing this,
> >>> but the IS seems to obtain a POJO from the storage plugin then write
> >> these
> >>> results to old-school Drill vectors.  I'm sure there was a reason it
> was
> >>> done this way, (or maybe not) but I have to wonder if there is a more
> >>> efficient way of obtaining the information from the storage plugin,
> >> ideally
> >>> w/o all the object creation.
> >>>>> These are just some thoughts, and I'm curious as to what the
> community
> >>> thinks about this.  Thanks everyone!
> >>>>> -- C
> >>>
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by James Turton <dz...@apache.org>.
My 2c while I wait for 1.20.0 RC1 to upload.

I think it's good that we continue to bring every design decision out 
here to the community like Charles did with this one.  Some relevant 
excerpts I turned up while zipping around a few ASF docs now.

    "Mailing lists are the virtual rooms where ASF communities live,
    form and grow. All formal decisions the project's PMC makes need to
    have an email thread (possibly with a recorded vote) as an audit
    trail that this was an official decision." [1]

    "We firmly believe in hats
    <https://www.apache.org/foundation/how-it-works.html#hats>. Your
    role at the ASF is one assigned to you personally, and is bestowed
    on you by your peers. It is not tied to your job or current employer
    or company." [2]

    "Unless they specifically state otherwise, whatever an ASF
    participant posts on any mailing list is done /as themselves/. It is
    the individual point-of-view, wearing their personal hat and not as
    a mouthpiece for whatever company happens to be signing their
    paychecks right now, and not even as a director of the ASF." [2]


Info schema.  Info schema is slow when the set of enabled storage 
plugins is slow to register schemas.  Flaky plugins can be so slow to do 
this as to make info schema appear broken.  Info schema recently had its 
filter push down improved so that unneeded schema registration is 
avoidable [3], and I tested it working in the case of an unreachable 
active PostgreSQL plugin (provided my WHERE clause excluded said pg).

In my opinion making today's "on-demand" info schema, which re-fetches 
schema metadata from sources whenever a query requests it, more 
efficient is the right place to start.  Rewriting it on EVF2 would, I 
understand, gain it limit push down support for free, though filter push 
down seems more likely to be helpful on this kind of data to me.  There 
is also no reason I can see for info schema not to fetch schema metadata 
from plugins concurrently.  I don't know if this would be best achieved 
by explicit programming of the concurrency, or by making the info schema 
look "splittable" to Drill so that multiple fragments get created.

Lastly, I'm generally against introducing any sort of results caching, 
data or metadata, except in special circumstances such as when the 
planner can be certain that the underlying data has not changed (seldom 
or never the case for Drill because it doesn't control its own storage 
layer).  I think that databases, reliable ones anyway, tend to shun 
results caching and push it to the application layer, since only that 
layer can decide what kind of staleness is acceptable, but correct me if 
I'm wrong.  My conclusion here is that I'd rather do this last, and only 
after careful consideration.

[1] https://infra.apache.org/mailing-list-moderation.html
[2] https://www.apache.org/foundation/how-it-works.html#management
[3] https://github.com/apache/drill/pull/2388

On 2022/02/07 21:05, Ted Dunning wrote:
> Another option is to store metadata as data in a distributed data store.
> For static resources, that can scale very well. For highly dynamic
> resources like conventional databases behind JDBC connections, you can
> generally delegate metadata to that layer. Performance for delegated
> metadata won't necessarily be great, but those systems are usually either
> small (like Postgress or mySQL) or fading away (like Hive).
>
> Focusing metadata and planning to a single node will make query concurrency
> much worse (and it's already not good).
>
>
> On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<pa...@gmail.com>  wrote:
>
>> Hi All,
>>
>> Drill, like all open source projects, exists to serve those that use it. To
>> that end, the best contributions come when some company needs a feature
>> badly enough that it is worth the effort to develop and contribute a
>> solution. That's pretty standard, as along as the contribution is general
>> purpose. In fact, I hope everyone using Drill in support of their company
>> will contribute enhancements back to Drill. If you maintain your own
>> private fork, you're not helping the community that provided you with the
>> bulk of the code.
>>
>> For the info schema, I'm at a loss to guess why this would be slow, unless
>> every plugin is going off and scanning some external source. Knowing that
>> we have a dozen plugins is not slow. Looking at plugin configs is not slow.
>> What could be slow is if you want to know about every possible file in HDFS
>> or S3, every database and table in an external DB, etc. In this case, the
>> bottleneck is either the external system, or the act of querying a dozen
>> different external systems. Perhap, Charles, you can elaborate on the
>> specific scenario you have in mind.
>>
>> Depending on the core issues, there are various solutions. One solution is
>> to cache all the external metadata in Drill. That's what Impala did with
>> the Hive Metastore, and it was a mess. I don't expect Drill would do any
>> better a job. One reason it was a mess is that, in a production system,
>> there is a vast amount of metadata. You end up playing all manner of tricks
>> to try to compress it. Since Drill (and Impala) are fully symmetric, each
>> node has to hold the entire cache. That is memory that can't be used to run
>> queries. So, to gain performance (for metadata) you give up performance (at
>> run time.)
>>
>> One solution is to create a separate metadata cache node. The query goes to
>> some Drillbit that acts as Foreman. The Foreman plans the query and
>> retrieves the needed metadata from the metadata node. The challenge here is
>> that there will be a large amount of metadata transferred, and the next
>> thing we know we'll want to cache it in each Drillbit, putting us back
>> where we started.
>>
>> So, one can go another step: shift all query planning to the metadata node
>> and have a single planner node. The user connects to any Drillbit as
>> Foreman, but that Foreman first talks to the "planner/metadata" node to
>> give it SQL and get back a plan. The Foreman then runs the plan as usual.
>> (The Foreman runs the root fragment of the plan, which can be compute
>> intensive, so we don't want the planner node to also act as the Foreman.)
>> The notion here is that the SQL in/plan out is much smaller than the
>> metadata that is needed to compute the plan.
>>
>> The idea about metadata has long been that Drill should provide a metadata
>> API. The Drill metastore should be seen as just one of many metadata
>> implementations. The Drill metastore is a "starter solution" for those who
>> have not already invested in another solution. (Many shops have HMS or
>> Amazon Glue, which is Amazon's version of HMS, or one of the newer
>> metadata/catalog solutions.)
>>
>> One can go even further. Consider file and directory pruning in HMS. Every
>> tool has to do the exact same thing: given a set of predicates, find the
>> directories and files that match. Impala does it. Spark must do it.
>> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
>> do it. Maybe someone has come with the One True Metadata Pruner and Drill
>> can just delegate the task to that external tool, and get back the list of
>> directories and files to scan. Far better than building yet another pruner.
>> (I think Drill currently has two Parquet metadata pruners, duplicating what
>> many other tools have done.)
>>
>> If we see the source of metadata as plugable, then a shop such as DDR that
>> has specific needs (maybe caching those external schemas), can build a
>> metadata plugin for that use case. If the solution is general, it can be
>> contributed to Drill as another metadata option.
>>
>> In any case, if we can better understand the specific problem you are
>> encountering, we can perhaps offer more specific suggestions.
>>
>> Thanks,
>>
>> - Paul
>>
>> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cg...@gmail.com>  wrote:
>>
>>> Hi Luoc,
>>> Thanks for your concern.  Apache projects are often backed unofficially
>> by
>>> a company.  Drill was, for years, backed my MapR as evident by all the
>> MapR
>>> unique code that is still in the Drill codebase. However, since MapR's
>>> acquisition, I think it is safe to say that Drill really has become a
>>> community-driven project.  While some of the committers are colleagues of
>>> mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
>>> perspective, we've really just been focusing on making Drill better for
>>> everyone as well as building the community of Drill users, regardless of
>>> whether they use DataDistillr or not.  We haven't rejected any PRs
>> because
>>> they go against our business model or tried to steer Drill against the
>>> community or anything like that.
>>>
>>> Just for your awareness, there are other OSS projects, including some
>>> Apache projects where one company controls everything.  Outside
>>> contributions are only accepted if they fit the company's roadmap, and
>>> there is no real community-building that happens.  From my perspective,
>>> that is not what I want from Drill.  My personal goal is to build an
>> active
>>> community of users and developers around an awesome tool.
>>>
>>> I hope this answers your concerns.
>>> Best,
>>> -- C
>>>
>>>
>>>> On Feb 6, 2022, at 9:42 AM, luoc<lu...@apache.org>  wrote:
>>>>
>>>>
>>>> Before we discuss the next release, I would like to explain that Apache
>>> project should not be directly linked to a commercial company, otherwise
>>> this will affect the motivation of the community to contribute.
>>>> Thanks.
>>>>
>>>>> On Feb 6, 2022, at 21:29, Charles Givre<cg...@gmail.com>  wrote:
>>>>>
>>>>> Hello all,
>>>>> Firstly, I wanted to thank everyone for all the work that has gone
>> into
>>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted
>>> to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA
>>> improvements.  As my company wades further and further into Drill, it has
>>> become apparent that the INFO_SCHEMA could use some attention.  James
>>> Turton submitted a PR which was merged into Drill 1.20, but in so doing
>> he
>>> uncovered an entire Pandora's box of other issues which might be worth
>>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
>>> performance related: it can be very slow and also can consume significant
>>> resources when executing even basic queries.
>>>>> My understanding of how the info schema (IS) works is that when a user
>>> executes a query, Drill will attempt to instantiate every enabled storage
>>> plugin to discover schemata and other information. As you might imagine,
>>> this can be costly.
>>>>> So, (and again, this is only meant as a conversation starter), I was
>>> thinking there are some general ideas as to how we might improve the IS:
>>>>> 1.  Implement a limit pushdown:  As far as I can tell, there is no
>>> limit pushdown in the IS and this could be a relatively quick win for
>>> improving IS query performance.
>>>>> 2.  Caching:  I understand that caching is tricky, but perhaps we
>> could
>>> add some sort of schema caching for IS queries, or make better use of the
>>> Drill metastore to reduce the number of connections during IS queries.
>>> Perhaps in combination with the metastore, we could implement some sort
>> of
>>> "metastore first" plan, whereby Drill first hits the metastore for query
>>> results and if the limit is reached, we're done.  If not, query the
>> storage
>>> plugins...
>>>>> 3.  Parallelization:  It did not appear to me that Drill parallelizes
>>> IS queries.   We may be able to add some parallelization which would
>>> improve overall speed, but not necessarily reduce overall compute cost
>>>>> 4.  Convert to EVF2:  Not sure that there's a performance benefit
>> here,
>>> but at least we could get rid of cruft
>>>>> 5.  Reduce SeDe:   I imagine there was a good reason for doing this,
>>> but the IS seems to obtain a POJO from the storage plugin then write
>> these
>>> results to old-school Drill vectors.  I'm sure there was a reason it was
>>> done this way, (or maybe not) but I have to wonder if there is a more
>>> efficient way of obtaining the information from the storage plugin,
>> ideally
>>> w/o all the object creation.
>>>>> These are just some thoughts, and I'm curious as to what the community
>>> thinks about this.  Thanks everyone!
>>>>> -- C
>>>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Ted Dunning <te...@gmail.com>.
Another option is to store metadata as data in a distributed data store.
For static resources, that can scale very well. For highly dynamic
resources like conventional databases behind JDBC connections, you can
generally delegate metadata to that layer. Performance for delegated
metadata won't necessarily be great, but those systems are usually either
small (like Postgress or mySQL) or fading away (like Hive).

Focusing metadata and planning to a single node will make query concurrency
much worse (and it's already not good).


On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi All,
>
> Drill, like all open source projects, exists to serve those that use it. To
> that end, the best contributions come when some company needs a feature
> badly enough that it is worth the effort to develop and contribute a
> solution. That's pretty standard, as along as the contribution is general
> purpose. In fact, I hope everyone using Drill in support of their company
> will contribute enhancements back to Drill. If you maintain your own
> private fork, you're not helping the community that provided you with the
> bulk of the code.
>
> For the info schema, I'm at a loss to guess why this would be slow, unless
> every plugin is going off and scanning some external source. Knowing that
> we have a dozen plugins is not slow. Looking at plugin configs is not slow.
> What could be slow is if you want to know about every possible file in HDFS
> or S3, every database and table in an external DB, etc. In this case, the
> bottleneck is either the external system, or the act of querying a dozen
> different external systems. Perhap, Charles, you can elaborate on the
> specific scenario you have in mind.
>
> Depending on the core issues, there are various solutions. One solution is
> to cache all the external metadata in Drill. That's what Impala did with
> the Hive Metastore, and it was a mess. I don't expect Drill would do any
> better a job. One reason it was a mess is that, in a production system,
> there is a vast amount of metadata. You end up playing all manner of tricks
> to try to compress it. Since Drill (and Impala) are fully symmetric, each
> node has to hold the entire cache. That is memory that can't be used to run
> queries. So, to gain performance (for metadata) you give up performance (at
> run time.)
>
> One solution is to create a separate metadata cache node. The query goes to
> some Drillbit that acts as Foreman. The Foreman plans the query and
> retrieves the needed metadata from the metadata node. The challenge here is
> that there will be a large amount of metadata transferred, and the next
> thing we know we'll want to cache it in each Drillbit, putting us back
> where we started.
>
> So, one can go another step: shift all query planning to the metadata node
> and have a single planner node. The user connects to any Drillbit as
> Foreman, but that Foreman first talks to the "planner/metadata" node to
> give it SQL and get back a plan. The Foreman then runs the plan as usual.
> (The Foreman runs the root fragment of the plan, which can be compute
> intensive, so we don't want the planner node to also act as the Foreman.)
> The notion here is that the SQL in/plan out is much smaller than the
> metadata that is needed to compute the plan.
>
> The idea about metadata has long been that Drill should provide a metadata
> API. The Drill metastore should be seen as just one of many metadata
> implementations. The Drill metastore is a "starter solution" for those who
> have not already invested in another solution. (Many shops have HMS or
> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> metadata/catalog solutions.)
>
> One can go even further. Consider file and directory pruning in HMS. Every
> tool has to do the exact same thing: given a set of predicates, find the
> directories and files that match. Impala does it. Spark must do it.
> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
> do it. Maybe someone has come with the One True Metadata Pruner and Drill
> can just delegate the task to that external tool, and get back the list of
> directories and files to scan. Far better than building yet another pruner.
> (I think Drill currently has two Parquet metadata pruners, duplicating what
> many other tools have done.)
>
> If we see the source of metadata as plugable, then a shop such as DDR that
> has specific needs (maybe caching those external schemas), can build a
> metadata plugin for that use case. If the solution is general, it can be
> contributed to Drill as another metadata option.
>
> In any case, if we can better understand the specific problem you are
> encountering, we can perhaps offer more specific suggestions.
>
> Thanks,
>
> - Paul
>
> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre <cg...@gmail.com> wrote:
>
> > Hi Luoc,
> > Thanks for your concern.  Apache projects are often backed unofficially
> by
> > a company.  Drill was, for years, backed my MapR as evident by all the
> MapR
> > unique code that is still in the Drill codebase. However, since MapR's
> > acquisition, I think it is safe to say that Drill really has become a
> > community-driven project.  While some of the committers are colleagues of
> > mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
> > perspective, we've really just been focusing on making Drill better for
> > everyone as well as building the community of Drill users, regardless of
> > whether they use DataDistillr or not.  We haven't rejected any PRs
> because
> > they go against our business model or tried to steer Drill against the
> > community or anything like that.
> >
> > Just for your awareness, there are other OSS projects, including some
> > Apache projects where one company controls everything.  Outside
> > contributions are only accepted if they fit the company's roadmap, and
> > there is no real community-building that happens.  From my perspective,
> > that is not what I want from Drill.  My personal goal is to build an
> active
> > community of users and developers around an awesome tool.
> >
> > I hope this answers your concerns.
> > Best,
> > -- C
> >
> >
> > > On Feb 6, 2022, at 9:42 AM, luoc <lu...@apache.org> wrote:
> > >
> > >
> > > Before we discuss the next release, I would like to explain that Apache
> > project should not be directly linked to a commercial company, otherwise
> > this will affect the motivation of the community to contribute.
> > >
> > > Thanks.
> > >
> > >> On Feb 6, 2022, at 21:29, Charles Givre <cg...@gmail.com> wrote:
> > >>
> > >> Hello all,
> > >> Firstly, I wanted to thank everyone for all the work that has gone
> into
> > Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted
> > to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA
> > improvements.  As my company wades further and further into Drill, it has
> > become apparent that the INFO_SCHEMA could use some attention.  James
> > Turton submitted a PR which was merged into Drill 1.20, but in so doing
> he
> > uncovered an entire Pandora's box of other issues which might be worth
> > addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> > performance related: it can be very slow and also can consume significant
> > resources when executing even basic queries.
> > >>
> > >> My understanding of how the info schema (IS) works is that when a user
> > executes a query, Drill will attempt to instantiate every enabled storage
> > plugin to discover schemata and other information. As you might imagine,
> > this can be costly.
> > >>
> > >> So, (and again, this is only meant as a conversation starter), I was
> > thinking there are some general ideas as to how we might improve the IS:
> > >> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> > limit pushdown in the IS and this could be a relatively quick win for
> > improving IS query performance.
> > >> 2.  Caching:  I understand that caching is tricky, but perhaps we
> could
> > add some sort of schema caching for IS queries, or make better use of the
> > Drill metastore to reduce the number of connections during IS queries.
> > Perhaps in combination with the metastore, we could implement some sort
> of
> > "metastore first" plan, whereby Drill first hits the metastore for query
> > results and if the limit is reached, we're done.  If not, query the
> storage
> > plugins...
> > >> 3.  Parallelization:  It did not appear to me that Drill parallelizes
> > IS queries.   We may be able to add some parallelization which would
> > improve overall speed, but not necessarily reduce overall compute cost
> > >> 4.  Convert to EVF2:  Not sure that there's a performance benefit
> here,
> > but at least we could get rid of cruft
> > >> 5.  Reduce SeDe:   I imagine there was a good reason for doing this,
> > but the IS seems to obtain a POJO from the storage plugin then write
> these
> > results to old-school Drill vectors.  I'm sure there was a reason it was
> > done this way, (or maybe not) but I have to wonder if there is a more
> > efficient way of obtaining the information from the storage plugin,
> ideally
> > w/o all the object creation.
> > >>
> > >> These are just some thoughts, and I'm curious as to what the community
> > thinks about this.  Thanks everyone!
> > >> -- C
> > >
> >
> >
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Paul Rogers <pa...@gmail.com>.
Hi All,

Drill, like all open source projects, exists to serve those that use it. To
that end, the best contributions come when some company needs a feature
badly enough that it is worth the effort to develop and contribute a
solution. That's pretty standard, as along as the contribution is general
purpose. In fact, I hope everyone using Drill in support of their company
will contribute enhancements back to Drill. If you maintain your own
private fork, you're not helping the community that provided you with the
bulk of the code.

For the info schema, I'm at a loss to guess why this would be slow, unless
every plugin is going off and scanning some external source. Knowing that
we have a dozen plugins is not slow. Looking at plugin configs is not slow.
What could be slow is if you want to know about every possible file in HDFS
or S3, every database and table in an external DB, etc. In this case, the
bottleneck is either the external system, or the act of querying a dozen
different external systems. Perhap, Charles, you can elaborate on the
specific scenario you have in mind.

Depending on the core issues, there are various solutions. One solution is
to cache all the external metadata in Drill. That's what Impala did with
the Hive Metastore, and it was a mess. I don't expect Drill would do any
better a job. One reason it was a mess is that, in a production system,
there is a vast amount of metadata. You end up playing all manner of tricks
to try to compress it. Since Drill (and Impala) are fully symmetric, each
node has to hold the entire cache. That is memory that can't be used to run
queries. So, to gain performance (for metadata) you give up performance (at
run time.)

One solution is to create a separate metadata cache node. The query goes to
some Drillbit that acts as Foreman. The Foreman plans the query and
retrieves the needed metadata from the metadata node. The challenge here is
that there will be a large amount of metadata transferred, and the next
thing we know we'll want to cache it in each Drillbit, putting us back
where we started.

So, one can go another step: shift all query planning to the metadata node
and have a single planner node. The user connects to any Drillbit as
Foreman, but that Foreman first talks to the "planner/metadata" node to
give it SQL and get back a plan. The Foreman then runs the plan as usual.
(The Foreman runs the root fragment of the plan, which can be compute
intensive, so we don't want the planner node to also act as the Foreman.)
The notion here is that the SQL in/plan out is much smaller than the
metadata that is needed to compute the plan.

The idea about metadata has long been that Drill should provide a metadata
API. The Drill metastore should be seen as just one of many metadata
implementations. The Drill metastore is a "starter solution" for those who
have not already invested in another solution. (Many shops have HMS or
Amazon Glue, which is Amazon's version of HMS, or one of the newer
metadata/catalog solutions.)

One can go even further. Consider file and directory pruning in HMS. Every
tool has to do the exact same thing: given a set of predicates, find the
directories and files that match. Impala does it. Spark must do it.
Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
do it. Maybe someone has come with the One True Metadata Pruner and Drill
can just delegate the task to that external tool, and get back the list of
directories and files to scan. Far better than building yet another pruner.
(I think Drill currently has two Parquet metadata pruners, duplicating what
many other tools have done.)

If we see the source of metadata as plugable, then a shop such as DDR that
has specific needs (maybe caching those external schemas), can build a
metadata plugin for that use case. If the solution is general, it can be
contributed to Drill as another metadata option.

In any case, if we can better understand the specific problem you are
encountering, we can perhaps offer more specific suggestions.

Thanks,

- Paul

On Sun, Feb 6, 2022 at 8:11 AM Charles Givre <cg...@gmail.com> wrote:

> Hi Luoc,
> Thanks for your concern.  Apache projects are often backed unofficially by
> a company.  Drill was, for years, backed my MapR as evident by all the MapR
> unique code that is still in the Drill codebase. However, since MapR's
> acquisition, I think it is safe to say that Drill really has become a
> community-driven project.  While some of the committers are colleagues of
> mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
> perspective, we've really just been focusing on making Drill better for
> everyone as well as building the community of Drill users, regardless of
> whether they use DataDistillr or not.  We haven't rejected any PRs because
> they go against our business model or tried to steer Drill against the
> community or anything like that.
>
> Just for your awareness, there are other OSS projects, including some
> Apache projects where one company controls everything.  Outside
> contributions are only accepted if they fit the company's roadmap, and
> there is no real community-building that happens.  From my perspective,
> that is not what I want from Drill.  My personal goal is to build an active
> community of users and developers around an awesome tool.
>
> I hope this answers your concerns.
> Best,
> -- C
>
>
> > On Feb 6, 2022, at 9:42 AM, luoc <lu...@apache.org> wrote:
> >
> >
> > Before we discuss the next release, I would like to explain that Apache
> project should not be directly linked to a commercial company, otherwise
> this will affect the motivation of the community to contribute.
> >
> > Thanks.
> >
> >> On Feb 6, 2022, at 21:29, Charles Givre <cg...@gmail.com> wrote:
> >>
> >> Hello all,
> >> Firstly, I wanted to thank everyone for all the work that has gone into
> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted
> to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA
> improvements.  As my company wades further and further into Drill, it has
> become apparent that the INFO_SCHEMA could use some attention.  James
> Turton submitted a PR which was merged into Drill 1.20, but in so doing he
> uncovered an entire Pandora's box of other issues which might be worth
> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
> performance related: it can be very slow and also can consume significant
> resources when executing even basic queries.
> >>
> >> My understanding of how the info schema (IS) works is that when a user
> executes a query, Drill will attempt to instantiate every enabled storage
> plugin to discover schemata and other information. As you might imagine,
> this can be costly.
> >>
> >> So, (and again, this is only meant as a conversation starter), I was
> thinking there are some general ideas as to how we might improve the IS:
> >> 1.  Implement a limit pushdown:  As far as I can tell, there is no
> limit pushdown in the IS and this could be a relatively quick win for
> improving IS query performance.
> >> 2.  Caching:  I understand that caching is tricky, but perhaps we could
> add some sort of schema caching for IS queries, or make better use of the
> Drill metastore to reduce the number of connections during IS queries.
> Perhaps in combination with the metastore, we could implement some sort of
> "metastore first" plan, whereby Drill first hits the metastore for query
> results and if the limit is reached, we're done.  If not, query the storage
> plugins...
> >> 3.  Parallelization:  It did not appear to me that Drill parallelizes
> IS queries.   We may be able to add some parallelization which would
> improve overall speed, but not necessarily reduce overall compute cost
> >> 4.  Convert to EVF2:  Not sure that there's a performance benefit here,
> but at least we could get rid of cruft
> >> 5.  Reduce SeDe:   I imagine there was a good reason for doing this,
> but the IS seems to obtain a POJO from the storage plugin then write these
> results to old-school Drill vectors.  I'm sure there was a reason it was
> done this way, (or maybe not) but I have to wonder if there is a more
> efficient way of obtaining the information from the storage plugin, ideally
> w/o all the object creation.
> >>
> >> These are just some thoughts, and I'm curious as to what the community
> thinks about this.  Thanks everyone!
> >> -- C
> >
>
>

Re: [DISCUSS] Some ideas for Drill 1.21

Posted by Charles Givre <cg...@gmail.com>.
Hi Luoc, 
Thanks for your concern.  Apache projects are often backed unofficially by a company.  Drill was, for years, backed my MapR as evident by all the MapR unique code that is still in the Drill codebase. However, since MapR's acquisition, I think it is safe to say that Drill really has become a community-driven project.  While some of the committers are colleagues of mine at DataDistillr, and Drill is a core part of DataDisitllr, from our perspective, we've really just been focusing on making Drill better for everyone as well as building the community of Drill users, regardless of whether they use DataDistillr or not.  We haven't rejected any PRs because they go against our business model or tried to steer Drill against the community or anything like that. 

Just for your awareness, there are other OSS projects, including some Apache projects where one company controls everything.  Outside contributions are only accepted if they fit the company's roadmap, and there is no real community-building that happens.  From my perspective, that is not what I want from Drill.  My personal goal is to build an active community of users and developers around an awesome tool. 

I hope this answers your concerns.
Best,
-- C


> On Feb 6, 2022, at 9:42 AM, luoc <lu...@apache.org> wrote:
> 
> 
> Before we discuss the next release, I would like to explain that Apache project should not be directly linked to a commercial company, otherwise this will affect the motivation of the community to contribute.
> 
> Thanks.
> 
>> On Feb 6, 2022, at 21:29, Charles Givre <cg...@gmail.com> wrote:
>> 
>> Hello all, 
>> Firstly, I wanted to thank everyone for all the work that has gone into Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA improvements.  As my company wades further and further into Drill, it has become apparent that the INFO_SCHEMA could use some attention.  James Turton submitted a PR which was merged into Drill 1.20, but in so doing he uncovered an entire Pandora's box of other issues which might be worth addressing.  In a nutshell, the issues with the INFO_SCHEMA are all performance related: it can be very slow and also can consume significant resources when executing even basic queries.  
>> 
>> My understanding of how the info schema (IS) works is that when a user executes a query, Drill will attempt to instantiate every enabled storage plugin to discover schemata and other information. As you might imagine, this can be costly. 
>> 
>> So, (and again, this is only meant as a conversation starter), I was thinking there are some general ideas as to how we might improve the IS:
>> 1.  Implement a limit pushdown:  As far as I can tell, there is no limit pushdown in the IS and this could be a relatively quick win for improving IS query performance.
>> 2.  Caching:  I understand that caching is tricky, but perhaps we could add some sort of schema caching for IS queries, or make better use of the Drill metastore to reduce the number of connections during IS queries.  Perhaps in combination with the metastore, we could implement some sort of "metastore first" plan, whereby Drill first hits the metastore for query results and if the limit is reached, we're done.  If not, query the storage plugins...
>> 3.  Parallelization:  It did not appear to me that Drill parallelizes IS queries.   We may be able to add some parallelization which would improve overall speed, but not necessarily reduce overall compute cost
>> 4.  Convert to EVF2:  Not sure that there's a performance benefit here, but at least we could get rid of cruft
>> 5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the IS seems to obtain a POJO from the storage plugin then write these results to old-school Drill vectors.  I'm sure there was a reason it was done this way, (or maybe not) but I have to wonder if there is a more efficient way of obtaining the information from the storage plugin, ideally w/o all the object creation. 
>> 
>> These are just some thoughts, and I'm curious as to what the community thinks about this.  Thanks everyone!
>> -- C
> 


Re: [DISCUSS] Some ideas for Drill 1.21

Posted by luoc <lu...@apache.org>.
Before we discuss the next release, I would like to explain that Apache project should not be directly linked to a commercial company, otherwise this will affect the motivation of the community to contribute.

Thanks.

> On Feb 6, 2022, at 21:29, Charles Givre <cg...@gmail.com> wrote:
> 
> Hello all, 
> Firstly, I wanted to thank everyone for all the work that has gone into Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA improvements.  As my company wades further and further into Drill, it has become apparent that the INFO_SCHEMA could use some attention.  James Turton submitted a PR which was merged into Drill 1.20, but in so doing he uncovered an entire Pandora's box of other issues which might be worth addressing.  In a nutshell, the issues with the INFO_SCHEMA are all performance related: it can be very slow and also can consume significant resources when executing even basic queries.  
> 
> My understanding of how the info schema (IS) works is that when a user executes a query, Drill will attempt to instantiate every enabled storage plugin to discover schemata and other information. As you might imagine, this can be costly. 
> 
> So, (and again, this is only meant as a conversation starter), I was thinking there are some general ideas as to how we might improve the IS:
> 1.  Implement a limit pushdown:  As far as I can tell, there is no limit pushdown in the IS and this could be a relatively quick win for improving IS query performance.
> 2.  Caching:  I understand that caching is tricky, but perhaps we could add some sort of schema caching for IS queries, or make better use of the Drill metastore to reduce the number of connections during IS queries.  Perhaps in combination with the metastore, we could implement some sort of "metastore first" plan, whereby Drill first hits the metastore for query results and if the limit is reached, we're done.  If not, query the storage plugins...
> 3.  Parallelization:  It did not appear to me that Drill parallelizes IS queries.   We may be able to add some parallelization which would improve overall speed, but not necessarily reduce overall compute cost
> 4.  Convert to EVF2:  Not sure that there's a performance benefit here, but at least we could get rid of cruft
> 5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the IS seems to obtain a POJO from the storage plugin then write these results to old-school Drill vectors.  I'm sure there was a reason it was done this way, (or maybe not) but I have to wonder if there is a more efficient way of obtaining the information from the storage plugin, ideally w/o all the object creation. 
> 
> These are just some thoughts, and I'm curious as to what the community thinks about this.  Thanks everyone!
> -- C