You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Miguel Branco <mi...@epfl.ch> on 2013/03/11 10:17:02 UTC

Questions on architecture / plans

Hi,

I'm trying to understand some links in the architecture, and thinking 
whether some of the work we've done can be contributed in any way (it's 
on fast "scan" operators: you're actually referencing to it under "NoDB" 
in your wiki on research/academic papers). But first we'd need to 
understand how the various Drill pieces fit together.

I understand this is an open project, and given stable APIs, we can have 
many implementations of any given part. But let's assume I'd like to 
implement a storage engine plus some scan/filter operator. There's the 
reference interpreter, which does the runtime execution and has its own 
reference operators. The goal there is not performance but just to 
demonstrate/prove functionality. But how do you see that moving forward? 
I've read about references early on about code generation & other 
runtime executors; is there any one already working on those already, or 
is it still far too early? Would that be done by effectively 
"replicating" all the functionality currently being built in the 
reference interpreter, or are there plans to take parts of that out & 
reuse? i.e. are there plans for some common plan interpreter stuff?

Regarding the linking with Optiq: I understand that's work in progress, 
but it's not quite clear to me how certain capabilities are advertised 
between Drill & Optiq: for instance, a storage engine plus scan 
operators have to be known to both systems. Will there be a need to 
advertise those separately to both systems, or will Optiq be embedded in 
Drill in some seamless way? Similarly, it's not quite clear to me how 
one implements a runtime for Drill, considering there's Optiq doing the 
query optimization, and both systems have to know what operators exist 
to optimize them/execute them. Or, to put it differently, how tightly 
coupled will Optiq and Drill be? [And, is there any other plans for 
query optimizations besides Optiq? Not that there's any issue with Optiq 
- I'm learning about it and am very very impressed with it - but if 
there alternative optimizers in the pipeline, then these issues have 
probably been addressed or will be so, soon.]

Any hints / description on your current thoughts would be most welcome. 
In fact, considering the project is now a few months old, it would be 
great to see where it's going *now*. It would allow us to understand 
where our work might fit in, to understand to which components (e.g. 
drill, optiq) it has to link to, and where we can help!

many thanks,
Miguel

Re: Questions on architecture / plans

Posted by Jacques Nadeau <ja...@apache.org>.

Hi Miguel,

So glad you dropped by.  Here are some thoughts in no particular order.

   - Its a great paper!
   - Drill is in-progress.  We're focused on having some strong api tiers
   and generating first implementations solving each of the tier problems.
    The key apis are going to be a the Storage Engine api [1], the logical
   plan [2] and the physical plan [3].  Which corresponds to tiers of parser,
   planner, execution engine, and storage engine implementations.  If someone
   from the community is very interested in one or more particular tiers, we
   would love to have them be involved.
   - The first goal is to use Optiq as a SQL parser for Logical Plan
   generation.  We've talked about also using it as an optimizer for Drill but
   I think about that as a separate and still TBD goal since the two phases
   are separated by a strong API.  If it is to be leveraged for that, it would
   need to be extended to be comfortable with nested and schema-late data, as
   well distributed query planning and approximate tree based query plans.
   - Just a few days ago I was talking to someone about using a fork of the
   Postgres codebase for a Drill optimizer as that seems to have worked well
   with others.
   - We haven't done any work yet on the the portions of the storage engine
   APIs available to each tier.  Clearly some functions need to be available.
    Someone just volunteered to work on DRILL-15 so hopefully some things will
   progress there and you could help influence that design.
   - The Reference Interpreter is really a prototype for dataflow.  It is
   unlikely that very much stuff from the reference data flow stuff will be
   leveraged in the first execution engine.  Metadata is more up in the air.
    Some of basic metadata and storage engine concepts there will likely be
   transferred over to the first execution engine.
   - It is likely that the first execution engine will probably generate
   code for things like record access and scalar expressions (likely first in
   Java).
   - I'm currently at work on portions of the first execution engine but
   have little to show for that, so far.
   - It looks like you guys have solved some interesting problems.  This
   project is all about community and trying to leverage as much work as
   possible.  It would be great if you you could be involved.

We can also chat over Skype etc.

Thanks for dropping by,
Jacques


[1] https://issues.apache.org/jira/browse/DRILL-13 (design needed)
[2]
https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
[3] https://issues.apache.org/jira/browse/DRILL-17 (in progress)


On Mon, Mar 11, 2013 at 2:17 AM, Miguel Branco <mi...@epfl.ch>wrote:

> Hi,
>
> I'm trying to understand some links in the architecture, and thinking
> whether some of the work we've done can be contributed in any way (it's on
> fast "scan" operators: you're actually referencing to it under "NoDB" in
> your wiki on research/academic papers). But first we'd need to understand
> how the various Drill pieces fit together.
>
> I understand this is an open project, and given stable APIs, we can have
> many implementations of any given part. But let's assume I'd like to
> implement a storage engine plus some scan/filter operator. There's the
> reference interpreter, which does the runtime execution and has its own
> reference operators. The goal there is not performance but just to
> demonstrate/prove functionality. But how do you see that moving forward?
> I've read about references early on about code generation & other runtime
> executors; is there any one already working on those already, or is it
> still far too early? Would that be done by effectively "replicating" all
> the functionality currently being built in the reference interpreter, or
> are there plans to take parts of that out & reuse? i.e. are there plans for
> some common plan interpreter stuff?
>
> Regarding the linking with Optiq: I understand that's work in progress,
> but it's not quite clear to me how certain capabilities are advertised
> between Drill & Optiq: for instance, a storage engine plus scan operators
> have to be known to both systems. Will there be a need to advertise those
> separately to both systems, or will Optiq be embedded in Drill in some
> seamless way? Similarly, it's not quite clear to me how one implements a
> runtime for Drill, considering there's Optiq doing the query optimization,
> and both systems have to know what operators exist to optimize them/execute
> them. Or, to put it differently, how tightly coupled will Optiq and Drill
> be? [And, is there any other plans for query optimizations besides Optiq?
> Not that there's any issue with Optiq - I'm learning about it and am very
> very impressed with it - but if there alternative optimizers in the
> pipeline, then these issues have probably been addressed or will be so,
> soon.]
>
> Any hints / description on your current thoughts would be most welcome. In
> fact, considering the project is now a few months old, it would be great to
> see where it's going *now*. It would allow us to understand where our work
> might fit in, to understand to which components (e.g. drill, optiq) it has
> to link to, and where we can help!
>
> many thanks,
> Miguel
>
>