You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by David Alves <da...@gmail.com> on 2013/03/11 11:02:04 UTC

contribution

Hi All

	I have a new academic project for which I'd like to use drill since none of the other parallel database over hadoop/nosql implementations fit just right.
	To this goal I've been tinkering with the prototype trying to find where I'd be most useful.

	Here's where I'd like to start, if you agree:
	- implement HBase storage engine (DRILL-15)
		- start with simple scanning an push down of selection/projection
	- implement the LogicalPlanBuilder (DRILL-45)
	- setup coding style in the wiki (formatting/imports etc, DRILL-46)
	- create builders for all logical plan elements/make logical plans immutable (no issue for this, I'd like to hear your thoughts first).

	Please let me know your thoughts, and if you agree please assign the issues to me (it seems that I can't assign them myself).

Best
David Alves

Re: contribution

Posted by Ted Yu <yu...@gmail.com>.

David:
It is so nice of you to work on DRILL-15

Thanks

On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com> wrote:

> Hi All
>
>         I have a new academic project for which I'd like to use drill
> since none of the other parallel database over hadoop/nosql implementations
> fit just right.
>         To this goal I've been tinkering with the prototype trying to find
> where I'd be most useful.
>
>         Here's where I'd like to start, if you agree:
>         - implement HBase storage engine (DRILL-15)
>                 - start with simple scanning an push down of
> selection/projection
>         - implement the LogicalPlanBuilder (DRILL-45)
>         - setup coding style in the wiki (formatting/imports etc, DRILL-46)
>         - create builders for all logical plan elements/make logical plans
> immutable (no issue for this, I'd like to hear your thoughts first).
>
>         Please let me know your thoughts, and if you agree please assign
> the issues to me (it seems that I can't assign them myself).
>
> Best
> David Alves

Re: contribution

Posted by Ted Dunning <te...@gmail.com>.

David,

These all look fabulous.

Be very careful, however, if you establish a history of contributing, you
might accidentally be nominated as a committer!

On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com> wrote:

> Hi All
>
>         I have a new academic project for which I'd like to use drill
> since none of the other parallel database over hadoop/nosql implementations
> fit just right.
>         To this goal I've been tinkering with the prototype trying to find
> where I'd be most useful.
>
>         Here's where I'd like to start, if you agree:
>         - implement HBase storage engine (DRILL-15)
>                 - start with simple scanning an push down of
> selection/projection
>         - implement the LogicalPlanBuilder (DRILL-45)
>         - setup coding style in the wiki (formatting/imports etc, DRILL-46)
>         - create builders for all logical plan elements/make logical plans
> immutable (no issue for this, I'd like to hear your thoughts first).
>
>         Please let me know your thoughts, and if you agree please assign
> the issues to me (it seems that I can't assign them myself).
>
> Best
> David Alves

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

Not yet.  I will share as soon as I get something cohesive together.

Thanks,
Jacques

On Fri, Mar 22, 2013 at 12:06 PM, David Alves <da...@gmail.com> wrote:

> Hey Jacques
>
>         Sorry to be a nag, but is there any change to take a sneak peak at
> the protobuf rpc stuff?
>         I'd really like hack something together wrt to the daemon this
> weekend.
>         Also, wrt to configuration management (zk/helix) maybe you could
> post the iface so that it'd be possible to hack something static (i.e.
> non-ft, properties file based) just to make dist execution work.
>
> Thanks
> David
>
> On Mar 16, 2013, at 8:34 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey David,
> >
> > The java-exec framework is not far enough along that it makes sense for
> me
> > to push it externally yet.  However, I did push my initial wip physical
> > plan approach.  You can find it here:
> > https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates
> >
> > Hopefully, I will get further along on the java-exec stuff soon.
> >
> > I'd suggest that you focus your energy on the StorageEngine API and HBase
> > implementation.  If you're up for it, let's do a quick skype chat to sync
> > up.  Let me know your availability over the next few days.
> >
> > Thanks,
> > Jacques
> >
> >
> >
> > On Fri, Mar 15, 2013 at 6:59 PM, David Alves <da...@gmail.com>
> wrote:
> >
> >> that'd be great thanks.
> >>
> >> -david
> >>
> >> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <ja...@gmail.com>
> >> wrote:
> >>
> >>> I've been under the weather the last few days and haven't made much
> >>> progress. Let me see if I can get you something tomorrow.
> >>>
> >>> On Mar 15, 2013, at 2:36 PM, David Alves <da...@gmail.com>
> wrote:
> >>>
> >>>> Hi Jacques
> >>>>
> >>>>  Is there any chance we could get a preview of this physical plan
> >> stuff and basic plumbing for distributed execution before the weekend?
> >> maybe in a github branch somewhere?
> >>>>  I mean it doesn't have to be complete or even running, I'd just like
> >> to make some progress with other stuff and keeping it in line with
> >> whichever plumbing you already have would be great.
> >>>>
> >>>> Best
> >>>> David
> >>>>
> >>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>
> >>>>> I'm working on some physical plan stuff as well as some basic
> plumbing
> >> for
> >>>>> distributed execution.  Its very in progress so I need to clean
> things
> >> up a
> >>>>> bit before we could collaborate/ divide and conquer on it.  Depending
> >> on
> >>>>> your timing and availability, maybe I could put some of this together
> >> in
> >>>>> the next couple days so that you could plug in rather than reinvent.
> >> In
> >>>>> the meantime, pushing forward the builder stuff, additional test
> cases
> >> on
> >>>>> the reference interpreter and/or thinking through the logical plan
> >> storage
> >>>>> engine pushdown/rewrite could be very useful.
> >>>>>
> >>>>> Let me know your thoughts.
> >>>>>
> >>>>> thanks,
> >>>>> Jacques
> >>>>>
> >>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Hi Jacques
> >>>>>>
> >>>>>>     I can assign issues to me now, thanks.
> >>>>>>     What you say wrt to the logical/physical/execution layers sounds
> >>>>>> good.
> >>>>>>     My main concern, for the moment is to have something working as
> >>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
> >> working
> >>>>>> hbase cluster and send them work to do in some form (first step
> would
> >> be to
> >>>>>> treat is as a non distributed engine where each daemon runs an
> >> instance of
> >>>>>> the prototype).
> >>>>>>     Here's where I'd like to go next:
> >>>>>>     - lay the ground work for the daemons (scripts/rpc iface/wiring
> >>>>>> protocol).
> >>>>>>     - create an execution engine iface that allows to abstract
> future
> >>>>>> implementations, and make it available through the rpc iface. this
> >> would
> >>>>>> sit in front of the ref impl for now and would be replaced by cpp
> >> down the
> >>>>>> line.
> >>>>>>
> >>>>>>     I think we can probably concentrate on the capabilities iface a
> >>>>>> bit down the line but, as a first approach, I see it simply
> providing
> >> a
> >>>>>> simple set of ops that it is able to run internally.
> >>>>>>     How to abstract locality/partitioning/schema capabilities is
> till
> >>>>>> not clear to me though, thoughts?
> >>>>>>
> >>>>>> David
> >>>>>>
> >>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>>>>>
> >>>>>>> I'm working on a presentation that will better illustrate the
> layers.
> >>>>>>> There are actually three key plans.  Thinking to date has been to
> >> break
> >>>>>>> the plans down into logical, physical and execution.  The third
> >> hasn't
> >>>>>> been
> >>>>>>> expressed well here and is entirely an internal domain to the
> >> execution
> >>>>>>> engine.  Following some classic methods: Logical expresses what we
> >> want
> >>>>>> to
> >>>>>>> do, Physical expresses how we want to do it (adding points of
> >>>>>>> parallelization but not specifying particular amounts of
> >> parallelization
> >>>>>> or
> >>>>>>> node by node assignments).  The execution engine is then
> responsible
> >> for
> >>>>>>> determining the amount of parallelization of a particular plan
> along
> >> with
> >>>>>>> system load (likely leveraging Berkeley's Sparrow work), task
> >> priority
> >>>>>> and
> >>>>>>> specific data locality information, building sub-dags to be
> assigned
> >> to
> >>>>>>> individual nodes and execute the plan.
> >>>>>>>
> >>>>>>> So in the higher logical and physical levels, a single Scan and
> >>>>>> subsequent
> >>>>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since
> >> they
> >>>>>>> ignore the level of separation we're planning for the real
> execution
> >>>>>> layer.
> >>>>>>> This is the why the current ref impl turns a single Scan into
> >> potentially
> >>>>>>> a union of ScanROPs... not elegant but logically correct.)
> >>>>>>>
> >>>>>>> The capabilities interface still needs to be defined for how a
> >> storage
> >>>>>>> engine reveals its logical capabilities and thus consumes part of
> the
> >>>>>> plan.
> >>>>>>>
> >>>>>>> J
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <
> davidralves@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Linsen
> >>>>>>>>
> >>>>>>>>    Some of what you are saying like push down of ops like filter,
> >>>>>>>> projection or partial aggregation below the storage engine scanner
> >>>>>> level,
> >>>>>>>> or sub tree execution are actively being discussed in issues
> >> DRILL-13
> >>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
> >> your
> >>>>>> input
> >>>>>>>> in these issues is most welcome.
> >>>>>>>>
> >>>>>>>>    HBase in particular has the notion of
> >>>>>>>> enpoints/coprocessors/filters that allow pushing this down easily
> >> (this
> >>>>>> is
> >>>>>>>> also in line with what other parallel database over nosql
> >>>>>> implementations
> >>>>>>>> like tajo do).
> >>>>>>>>    A possible approach is to have the optimizer change the order
> of
> >>>>>>>> the ops to place them below the storage engine scanner and let the
> >> SE
> >>>>>> impl
> >>>>>>>> deal with it internally.
> >>>>>>>>
> >>>>>>>>    There are also some other pieces missing at the moment AFAIK,
> >>>>>> like
> >>>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
> >>>>>>>>
> >>>>>>>>    So in summary, you're absolutely right, and if you're
> >>>>>> particularly
> >>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> >>>>>> interested
> >>>>>>>> in collaborating.
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi David,
> >>>>>>>>>
> >>>>>>>>> Very nice to see your effort on this.
> >>>>>>>>>
> >>>>>>>>> Hi Jacques,
> >>>>>>>>>
> >>>>>>>>> we are also extending drill prototype, to see if there is any
> >> chance to
> >>>>>>>>> meet our production need. However, We find that implementing a
> >>>>>> performant
> >>>>>>>>> HBase storage engine is a not so straight-forward work, and
> >> requires
> >>>>>> some
> >>>>>>>>> workaround. The problem is in Scan interface.
> >>>>>>>>>
> >>>>>>>>> In drill's physical plan model, ScanROP is in charge of table
> scan.
> >>>>>>>> Storage
> >>>>>>>>> engine provides output for a whole data source, a csv file for
> >> example.
> >>>>>>>>> It's sufficient for input source like plain file, but for hbase,
> >> it's
> >>>>>> not
> >>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a
> whole
> >>>>>> htable
> >>>>>>>>> into drill. Storage engines like HBase should have some ablility
> >> to do
> >>>>>>>> part
> >>>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
> >>>>>>>> specifying
> >>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do
> more,
> >>>>>> even
> >>>>>>>>> Join.
> >>>>>>>>>
> >>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
> >> sub-DAG
> >>>>>> of
> >>>>>>>>> logical plan DAG instead of a single Scan node in logical plan.
> If
> >> so,
> >>>>>>>> more
> >>>>>>>>> implementation-specific information would coupe into the plan
> >>>>>>>> optimization
> >>>>>>>>> & transformation phase. I guess that's the price to pay when
> >>>>>> optimization
> >>>>>>>>> comes, or is there other way I failed to see?
> >>>>>>>>>
> >>>>>>>>> Please correct me if anything is wrong.
> >>>>>>>>>
> >>>>>>>>> thanks,
> >>>>>>>>>
> >>>>>>>>> Lisen
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
> >> davidralves@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jacques
> >>>>>>>>>>
> >>>>>>>>>>   I've submitted a fist pass patch to DRILL-15.
> >>>>>>>>>>   I did this mostly because HBase will be my main target and
> >>>>>>>> because
> >>>>>>>>>> I wanted to get a feel of what would be a nice interface for
> >> DRILL-13.
> >>>>>>>> Have
> >>>>>>>>>> some thoughts that I will post soon.
> >>>>>>>>>>   btw: I still can't assign issues to myself in JIRA, did you
> >>>>>>>> forget
> >>>>>>>>>> to add me as a contributor?
> >>>>>>>>>>
> >>>>>>>>>> Best
> >>>>>>>>>> David
> >>>>>>>>>>
> >>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org
> >
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey David,
> >>>>>>>>>>>
> >>>>>>>>>>> These sound good.  I've add you as a contributor on jira so you
> >> can
> >>>>>>>>>> assign
> >>>>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
> >> 15
> >>>>>>>>>> depends
> >>>>>>>>>>> on 13 and working on the two hand in hand would probably be a
> >> good
> >>>>>>>> idea.
> >>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once
> you
> >> have
> >>>>>>>>>> some
> >>>>>>>>>>> time to focus on it.
> >>>>>>>>>>>
> >>>>>>>>>>> Jacques
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> >> davidralves@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi All
> >>>>>>>>>>>>
> >>>>>>>>>>>>  I have a new academic project for which I'd like to use drill
> >>>>>>>>>>>> since none of the other parallel database over hadoop/nosql
> >>>>>>>>>> implementations
> >>>>>>>>>>>> fit just right.
> >>>>>>>>>>>>  To this goal I've been tinkering with the prototype trying to
> >>>>>>>>>> find
> >>>>>>>>>>>> where I'd be most useful.
> >>>>>>>>>>>>
> >>>>>>>>>>>>  Here's where I'd like to start, if you agree:
> >>>>>>>>>>>>  - implement HBase storage engine (DRILL-15)
> >>>>>>>>>>>>          - start with simple scanning an push down of
> >>>>>>>>>>>> selection/projection
> >>>>>>>>>>>>  - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>>>>>>>  - setup coding style in the wiki (formatting/imports etc,
> >>>>>>>>>> DRILL-46)
> >>>>>>>>>>>>  - create builders for all logical plan elements/make logical
> >>>>>>>>>> plans
> >>>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> >> first).
> >>>>>>>>>>>>
> >>>>>>>>>>>>  Please let me know your thoughts, and if you agree please
> >>>>>> assign
> >>>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best
> >>>>>>>>>>>> David Alves
> >>>>
> >>
> >>
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Hey Jacques

	Sorry to be a nag, but is there any change to take a sneak peak at the protobuf rpc stuff?
	I'd really like hack something together wrt to the daemon this weekend.
	Also, wrt to configuration management (zk/helix) maybe you could post the iface so that it'd be possible to hack something static (i.e. non-ft, properties file based) just to make dist execution work.

Thanks
David

On Mar 16, 2013, at 8:34 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Hey David,
> 
> The java-exec framework is not far enough along that it makes sense for me
> to push it externally yet.  However, I did push my initial wip physical
> plan approach.  You can find it here:
> https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates
> 
> Hopefully, I will get further along on the java-exec stuff soon.
> 
> I'd suggest that you focus your energy on the StorageEngine API and HBase
> implementation.  If you're up for it, let's do a quick skype chat to sync
> up.  Let me know your availability over the next few days.
> 
> Thanks,
> Jacques
> 
> 
> 
> On Fri, Mar 15, 2013 at 6:59 PM, David Alves <da...@gmail.com> wrote:
> 
>> that'd be great thanks.
>> 
>> -david
>> 
>> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <ja...@gmail.com>
>> wrote:
>> 
>>> I've been under the weather the last few days and haven't made much
>>> progress. Let me see if I can get you something tomorrow.
>>> 
>>> On Mar 15, 2013, at 2:36 PM, David Alves <da...@gmail.com> wrote:
>>> 
>>>> Hi Jacques
>>>> 
>>>>  Is there any chance we could get a preview of this physical plan
>> stuff and basic plumbing for distributed execution before the weekend?
>> maybe in a github branch somewhere?
>>>>  I mean it doesn't have to be complete or even running, I'd just like
>> to make some progress with other stuff and keeping it in line with
>> whichever plumbing you already have would be great.
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>> 
>>>>> I'm working on some physical plan stuff as well as some basic plumbing
>> for
>>>>> distributed execution.  Its very in progress so I need to clean things
>> up a
>>>>> bit before we could collaborate/ divide and conquer on it.  Depending
>> on
>>>>> your timing and availability, maybe I could put some of this together
>> in
>>>>> the next couple days so that you could plug in rather than reinvent.
>> In
>>>>> the meantime, pushing forward the builder stuff, additional test cases
>> on
>>>>> the reference interpreter and/or thinking through the logical plan
>> storage
>>>>> engine pushdown/rewrite could be very useful.
>>>>> 
>>>>> Let me know your thoughts.
>>>>> 
>>>>> thanks,
>>>>> Jacques
>>>>> 
>>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hi Jacques
>>>>>> 
>>>>>>     I can assign issues to me now, thanks.
>>>>>>     What you say wrt to the logical/physical/execution layers sounds
>>>>>> good.
>>>>>>     My main concern, for the moment is to have something working as
>>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
>> working
>>>>>> hbase cluster and send them work to do in some form (first step would
>> be to
>>>>>> treat is as a non distributed engine where each daemon runs an
>> instance of
>>>>>> the prototype).
>>>>>>     Here's where I'd like to go next:
>>>>>>     - lay the ground work for the daemons (scripts/rpc iface/wiring
>>>>>> protocol).
>>>>>>     - create an execution engine iface that allows to abstract future
>>>>>> implementations, and make it available through the rpc iface. this
>> would
>>>>>> sit in front of the ref impl for now and would be replaced by cpp
>> down the
>>>>>> line.
>>>>>> 
>>>>>>     I think we can probably concentrate on the capabilities iface a
>>>>>> bit down the line but, as a first approach, I see it simply providing
>> a
>>>>>> simple set of ops that it is able to run internally.
>>>>>>     How to abstract locality/partitioning/schema capabilities is till
>>>>>> not clear to me though, thoughts?
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>>>> 
>>>>>>> I'm working on a presentation that will better illustrate the layers.
>>>>>>> There are actually three key plans.  Thinking to date has been to
>> break
>>>>>>> the plans down into logical, physical and execution.  The third
>> hasn't
>>>>>> been
>>>>>>> expressed well here and is entirely an internal domain to the
>> execution
>>>>>>> engine.  Following some classic methods: Logical expresses what we
>> want
>>>>>> to
>>>>>>> do, Physical expresses how we want to do it (adding points of
>>>>>>> parallelization but not specifying particular amounts of
>> parallelization
>>>>>> or
>>>>>>> node by node assignments).  The execution engine is then responsible
>> for
>>>>>>> determining the amount of parallelization of a particular plan along
>> with
>>>>>>> system load (likely leveraging Berkeley's Sparrow work), task
>> priority
>>>>>> and
>>>>>>> specific data locality information, building sub-dags to be assigned
>> to
>>>>>>> individual nodes and execute the plan.
>>>>>>> 
>>>>>>> So in the higher logical and physical levels, a single Scan and
>>>>>> subsequent
>>>>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since
>> they
>>>>>>> ignore the level of separation we're planning for the real execution
>>>>>> layer.
>>>>>>> This is the why the current ref impl turns a single Scan into
>> potentially
>>>>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>>>> 
>>>>>>> The capabilities interface still needs to be defined for how a
>> storage
>>>>>>> engine reveals its logical capabilities and thus consumes part of the
>>>>>> plan.
>>>>>>> 
>>>>>>> J
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Linsen
>>>>>>>> 
>>>>>>>>    Some of what you are saying like push down of ops like filter,
>>>>>>>> projection or partial aggregation below the storage engine scanner
>>>>>> level,
>>>>>>>> or sub tree execution are actively being discussed in issues
>> DRILL-13
>>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
>> your
>>>>>> input
>>>>>>>> in these issues is most welcome.
>>>>>>>> 
>>>>>>>>    HBase in particular has the notion of
>>>>>>>> enpoints/coprocessors/filters that allow pushing this down easily
>> (this
>>>>>> is
>>>>>>>> also in line with what other parallel database over nosql
>>>>>> implementations
>>>>>>>> like tajo do).
>>>>>>>>    A possible approach is to have the optimizer change the order of
>>>>>>>> the ops to place them below the storage engine scanner and let the
>> SE
>>>>>> impl
>>>>>>>> deal with it internally.
>>>>>>>> 
>>>>>>>>    There are also some other pieces missing at the moment AFAIK,
>>>>>> like
>>>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>>>> 
>>>>>>>>    So in summary, you're absolutely right, and if you're
>>>>>> particularly
>>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>>>>> interested
>>>>>>>> in collaborating.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi David,
>>>>>>>>> 
>>>>>>>>> Very nice to see your effort on this.
>>>>>>>>> 
>>>>>>>>> Hi Jacques,
>>>>>>>>> 
>>>>>>>>> we are also extending drill prototype, to see if there is any
>> chance to
>>>>>>>>> meet our production need. However, We find that implementing a
>>>>>> performant
>>>>>>>>> HBase storage engine is a not so straight-forward work, and
>> requires
>>>>>> some
>>>>>>>>> workaround. The problem is in Scan interface.
>>>>>>>>> 
>>>>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>>>>> Storage
>>>>>>>>> engine provides output for a whole data source, a csv file for
>> example.
>>>>>>>>> It's sufficient for input source like plain file, but for hbase,
>> it's
>>>>>> not
>>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>>>>> htable
>>>>>>>>> into drill. Storage engines like HBase should have some ablility
>> to do
>>>>>>>> part
>>>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>>>>> specifying
>>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>>>>> even
>>>>>>>>> Join.
>>>>>>>>> 
>>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
>> sub-DAG
>>>>>> of
>>>>>>>>> logical plan DAG instead of a single Scan node in logical plan. If
>> so,
>>>>>>>> more
>>>>>>>>> implementation-specific information would coupe into the plan
>>>>>>>> optimization
>>>>>>>>> & transformation phase. I guess that's the price to pay when
>>>>>> optimization
>>>>>>>>> comes, or is there other way I failed to see?
>>>>>>>>> 
>>>>>>>>> Please correct me if anything is wrong.
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> 
>>>>>>>>> Lisen
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
>> davidralves@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jacques
>>>>>>>>>> 
>>>>>>>>>>   I've submitted a fist pass patch to DRILL-15.
>>>>>>>>>>   I did this mostly because HBase will be my main target and
>>>>>>>> because
>>>>>>>>>> I wanted to get a feel of what would be a nice interface for
>> DRILL-13.
>>>>>>>> Have
>>>>>>>>>> some thoughts that I will post soon.
>>>>>>>>>>   btw: I still can't assign issues to myself in JIRA, did you
>>>>>>>> forget
>>>>>>>>>> to add me as a contributor?
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hey David,
>>>>>>>>>>> 
>>>>>>>>>>> These sound good.  I've add you as a contributor on jira so you
>> can
>>>>>>>>>> assign
>>>>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
>> 15
>>>>>>>>>> depends
>>>>>>>>>>> on 13 and working on the two hand in hand would probably be a
>> good
>>>>>>>> idea.
>>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
>> have
>>>>>>>>>> some
>>>>>>>>>>> time to focus on it.
>>>>>>>>>>> 
>>>>>>>>>>> Jacques
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
>> davidralves@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi All
>>>>>>>>>>>> 
>>>>>>>>>>>>  I have a new academic project for which I'd like to use drill
>>>>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>>>>> implementations
>>>>>>>>>>>> fit just right.
>>>>>>>>>>>>  To this goal I've been tinkering with the prototype trying to
>>>>>>>>>> find
>>>>>>>>>>>> where I'd be most useful.
>>>>>>>>>>>> 
>>>>>>>>>>>>  Here's where I'd like to start, if you agree:
>>>>>>>>>>>>  - implement HBase storage engine (DRILL-15)
>>>>>>>>>>>>          - start with simple scanning an push down of
>>>>>>>>>>>> selection/projection
>>>>>>>>>>>>  - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>>>>  - setup coding style in the wiki (formatting/imports etc,
>>>>>>>>>> DRILL-46)
>>>>>>>>>>>>  - create builders for all logical plan elements/make logical
>>>>>>>>>> plans
>>>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
>> first).
>>>>>>>>>>>> 
>>>>>>>>>>>>  Please let me know your thoughts, and if you agree please
>>>>>> assign
>>>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>>>> 
>>>>>>>>>>>> Best
>>>>>>>>>>>> David Alves
>>>> 
>> 
>>

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

Hey David,

The java-exec framework is not far enough along that it makes sense for me
to push it externally yet.  However, I did push my initial wip physical
plan approach.  You can find it here:
https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates

Hopefully, I will get further along on the java-exec stuff soon.

I'd suggest that you focus your energy on the StorageEngine API and HBase
implementation.  If you're up for it, let's do a quick skype chat to sync
up.  Let me know your availability over the next few days.

Thanks,
Jacques



On Fri, Mar 15, 2013 at 6:59 PM, David Alves <da...@gmail.com> wrote:

> that'd be great thanks.
>
> -david
>
> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <ja...@gmail.com>
> wrote:
>
> > I've been under the weather the last few days and haven't made much
> > progress. Let me see if I can get you something tomorrow.
> >
> > On Mar 15, 2013, at 2:36 PM, David Alves <da...@gmail.com> wrote:
> >
> >> Hi Jacques
> >>
> >>   Is there any chance we could get a preview of this physical plan
> stuff and basic plumbing for distributed execution before the weekend?
> maybe in a github branch somewhere?
> >>   I mean it doesn't have to be complete or even running, I'd just like
> to make some progress with other stuff and keeping it in line with
> whichever plumbing you already have would be great.
> >>
> >> Best
> >> David
> >>
> >> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >>
> >>> I'm working on some physical plan stuff as well as some basic plumbing
> for
> >>> distributed execution.  Its very in progress so I need to clean things
> up a
> >>> bit before we could collaborate/ divide and conquer on it.  Depending
> on
> >>> your timing and availability, maybe I could put some of this together
> in
> >>> the next couple days so that you could plug in rather than reinvent.
>  In
> >>> the meantime, pushing forward the builder stuff, additional test cases
> on
> >>> the reference interpreter and/or thinking through the logical plan
> storage
> >>> engine pushdown/rewrite could be very useful.
> >>>
> >>> Let me know your thoughts.
> >>>
> >>> thanks,
> >>> Jacques
> >>>
> >>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com>
> wrote:
> >>>
> >>>> Hi Jacques
> >>>>
> >>>>      I can assign issues to me now, thanks.
> >>>>      What you say wrt to the logical/physical/execution layers sounds
> >>>> good.
> >>>>      My main concern, for the moment is to have something working as
> >>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
> working
> >>>> hbase cluster and send them work to do in some form (first step would
> be to
> >>>> treat is as a non distributed engine where each daemon runs an
> instance of
> >>>> the prototype).
> >>>>      Here's where I'd like to go next:
> >>>>      - lay the ground work for the daemons (scripts/rpc iface/wiring
> >>>> protocol).
> >>>>      - create an execution engine iface that allows to abstract future
> >>>> implementations, and make it available through the rpc iface. this
> would
> >>>> sit in front of the ref impl for now and would be replaced by cpp
> down the
> >>>> line.
> >>>>
> >>>>      I think we can probably concentrate on the capabilities iface a
> >>>> bit down the line but, as a first approach, I see it simply providing
> a
> >>>> simple set of ops that it is able to run internally.
> >>>>      How to abstract locality/partitioning/schema capabilities is till
> >>>> not clear to me though, thoughts?
> >>>>
> >>>> David
> >>>>
> >>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>
> >>>>> I'm working on a presentation that will better illustrate the layers.
> >>>>> There are actually three key plans.  Thinking to date has been to
> break
> >>>>> the plans down into logical, physical and execution.  The third
> hasn't
> >>>> been
> >>>>> expressed well here and is entirely an internal domain to the
> execution
> >>>>> engine.  Following some classic methods: Logical expresses what we
> want
> >>>> to
> >>>>> do, Physical expresses how we want to do it (adding points of
> >>>>> parallelization but not specifying particular amounts of
> parallelization
> >>>> or
> >>>>> node by node assignments).  The execution engine is then responsible
> for
> >>>>> determining the amount of parallelization of a particular plan along
> with
> >>>>> system load (likely leveraging Berkeley's Sparrow work), task
> priority
> >>>> and
> >>>>> specific data locality information, building sub-dags to be assigned
> to
> >>>>> individual nodes and execute the plan.
> >>>>>
> >>>>> So in the higher logical and physical levels, a single Scan and
> >>>> subsequent
> >>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since
> they
> >>>>> ignore the level of separation we're planning for the real execution
> >>>> layer.
> >>>>> This is the why the current ref impl turns a single Scan into
> potentially
> >>>>> a union of ScanROPs... not elegant but logically correct.)
> >>>>>
> >>>>> The capabilities interface still needs to be defined for how a
> storage
> >>>>> engine reveals its logical capabilities and thus consumes part of the
> >>>> plan.
> >>>>>
> >>>>> J
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>>> Hi Linsen
> >>>>>>
> >>>>>>     Some of what you are saying like push down of ops like filter,
> >>>>>> projection or partial aggregation below the storage engine scanner
> >>>> level,
> >>>>>> or sub tree execution are actively being discussed in issues
> DRILL-13
> >>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
> your
> >>>> input
> >>>>>> in these issues is most welcome.
> >>>>>>
> >>>>>>     HBase in particular has the notion of
> >>>>>> enpoints/coprocessors/filters that allow pushing this down easily
> (this
> >>>> is
> >>>>>> also in line with what other parallel database over nosql
> >>>> implementations
> >>>>>> like tajo do).
> >>>>>>     A possible approach is to have the optimizer change the order of
> >>>>>> the ops to place them below the storage engine scanner and let the
> SE
> >>>> impl
> >>>>>> deal with it internally.
> >>>>>>
> >>>>>>     There are also some other pieces missing at the moment AFAIK,
> >>>> like
> >>>>>> a distributed metadata store, the drill daemons, wiring, etc.
> >>>>>>
> >>>>>>     So in summary, you're absolutely right, and if you're
> >>>> particularly
> >>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> >>>> interested
> >>>>>> in collaborating.
> >>>>>>
> >>>>>> Best
> >>>>>> David
> >>>>>>
> >>>>>>
> >>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi David,
> >>>>>>>
> >>>>>>> Very nice to see your effort on this.
> >>>>>>>
> >>>>>>> Hi Jacques,
> >>>>>>>
> >>>>>>> we are also extending drill prototype, to see if there is any
> chance to
> >>>>>>> meet our production need. However, We find that implementing a
> >>>> performant
> >>>>>>> HBase storage engine is a not so straight-forward work, and
> requires
> >>>> some
> >>>>>>> workaround. The problem is in Scan interface.
> >>>>>>>
> >>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
> >>>>>> Storage
> >>>>>>> engine provides output for a whole data source, a csv file for
> example.
> >>>>>>> It's sufficient for input source like plain file, but for hbase,
> it's
> >>>> not
> >>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
> >>>> htable
> >>>>>>> into drill. Storage engines like HBase should have some ablility
> to do
> >>>>>> part
> >>>>>>> of the DrQL query, like Filter, if a filter can be performed by
> >>>>>> specifying
> >>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
> >>>> even
> >>>>>>> Join.
> >>>>>>>
> >>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
> sub-DAG
> >>>> of
> >>>>>>> logical plan DAG instead of a single Scan node in logical plan. If
> so,
> >>>>>> more
> >>>>>>> implementation-specific information would coupe into the plan
> >>>>>> optimization
> >>>>>>> & transformation phase. I guess that's the price to pay when
> >>>> optimization
> >>>>>>> comes, or is there other way I failed to see?
> >>>>>>>
> >>>>>>> Please correct me if anything is wrong.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>>
> >>>>>>> Lisen
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
> davidralves@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Jacques
> >>>>>>>>
> >>>>>>>>    I've submitted a fist pass patch to DRILL-15.
> >>>>>>>>    I did this mostly because HBase will be my main target and
> >>>>>> because
> >>>>>>>> I wanted to get a feel of what would be a nice interface for
> DRILL-13.
> >>>>>> Have
> >>>>>>>> some thoughts that I will post soon.
> >>>>>>>>    btw: I still can't assign issues to myself in JIRA, did you
> >>>>>> forget
> >>>>>>>> to add me as a contributor?
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey David,
> >>>>>>>>>
> >>>>>>>>> These sound good.  I've add you as a contributor on jira so you
> can
> >>>>>>>> assign
> >>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
>  15
> >>>>>>>> depends
> >>>>>>>>> on 13 and working on the two hand in hand would probably be a
> good
> >>>>>> idea.
> >>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
> have
> >>>>>>>> some
> >>>>>>>>> time to focus on it.
> >>>>>>>>>
> >>>>>>>>> Jacques
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> davidralves@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi All
> >>>>>>>>>>
> >>>>>>>>>>   I have a new academic project for which I'd like to use drill
> >>>>>>>>>> since none of the other parallel database over hadoop/nosql
> >>>>>>>> implementations
> >>>>>>>>>> fit just right.
> >>>>>>>>>>   To this goal I've been tinkering with the prototype trying to
> >>>>>>>> find
> >>>>>>>>>> where I'd be most useful.
> >>>>>>>>>>
> >>>>>>>>>>   Here's where I'd like to start, if you agree:
> >>>>>>>>>>   - implement HBase storage engine (DRILL-15)
> >>>>>>>>>>           - start with simple scanning an push down of
> >>>>>>>>>> selection/projection
> >>>>>>>>>>   - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>>>>>   - setup coding style in the wiki (formatting/imports etc,
> >>>>>>>> DRILL-46)
> >>>>>>>>>>   - create builders for all logical plan elements/make logical
> >>>>>>>> plans
> >>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> first).
> >>>>>>>>>>
> >>>>>>>>>>   Please let me know your thoughts, and if you agree please
> >>>> assign
> >>>>>>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>>>>>
> >>>>>>>>>> Best
> >>>>>>>>>> David Alves
> >>
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

that'd be great thanks.

-david

On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <ja...@gmail.com> wrote:

> I've been under the weather the last few days and haven't made much
> progress. Let me see if I can get you something tomorrow.
> 
> On Mar 15, 2013, at 2:36 PM, David Alves <da...@gmail.com> wrote:
> 
>> Hi Jacques
>> 
>>   Is there any chance we could get a preview of this physical plan stuff and basic plumbing for distributed execution before the weekend? maybe in a github branch somewhere?
>>   I mean it doesn't have to be complete or even running, I'd just like to make some progress with other stuff and keeping it in line with whichever plumbing you already have would be great.
>> 
>> Best
>> David
>> 
>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> 
>>> I'm working on some physical plan stuff as well as some basic plumbing for
>>> distributed execution.  Its very in progress so I need to clean things up a
>>> bit before we could collaborate/ divide and conquer on it.  Depending on
>>> your timing and availability, maybe I could put some of this together in
>>> the next couple days so that you could plug in rather than reinvent.  In
>>> the meantime, pushing forward the builder stuff, additional test cases on
>>> the reference interpreter and/or thinking through the logical plan storage
>>> engine pushdown/rewrite could be very useful.
>>> 
>>> Let me know your thoughts.
>>> 
>>> thanks,
>>> Jacques
>>> 
>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com> wrote:
>>> 
>>>> Hi Jacques
>>>> 
>>>>      I can assign issues to me now, thanks.
>>>>      What you say wrt to the logical/physical/execution layers sounds
>>>> good.
>>>>      My main concern, for the moment is to have something working as
>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>>>> hbase cluster and send them work to do in some form (first step would be to
>>>> treat is as a non distributed engine where each daemon runs an instance of
>>>> the prototype).
>>>>      Here's where I'd like to go next:
>>>>      - lay the ground work for the daemons (scripts/rpc iface/wiring
>>>> protocol).
>>>>      - create an execution engine iface that allows to abstract future
>>>> implementations, and make it available through the rpc iface. this would
>>>> sit in front of the ref impl for now and would be replaced by cpp down the
>>>> line.
>>>> 
>>>>      I think we can probably concentrate on the capabilities iface a
>>>> bit down the line but, as a first approach, I see it simply providing a
>>>> simple set of ops that it is able to run internally.
>>>>      How to abstract locality/partitioning/schema capabilities is till
>>>> not clear to me though, thoughts?
>>>> 
>>>> David
>>>> 
>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
>>>> 
>>>>> I'm working on a presentation that will better illustrate the layers.
>>>>> There are actually three key plans.  Thinking to date has been to break
>>>>> the plans down into logical, physical and execution.  The third hasn't
>>>> been
>>>>> expressed well here and is entirely an internal domain to the execution
>>>>> engine.  Following some classic methods: Logical expresses what we want
>>>> to
>>>>> do, Physical expresses how we want to do it (adding points of
>>>>> parallelization but not specifying particular amounts of parallelization
>>>> or
>>>>> node by node assignments).  The execution engine is then responsible for
>>>>> determining the amount of parallelization of a particular plan along with
>>>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>>>> and
>>>>> specific data locality information, building sub-dags to be assigned to
>>>>> individual nodes and execute the plan.
>>>>> 
>>>>> So in the higher logical and physical levels, a single Scan and
>>>> subsequent
>>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>>>> ignore the level of separation we're planning for the real execution
>>>> layer.
>>>>> This is the why the current ref impl turns a single Scan into potentially
>>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>> 
>>>>> The capabilities interface still needs to be defined for how a storage
>>>>> engine reveals its logical capabilities and thus consumes part of the
>>>> plan.
>>>>> 
>>>>> J
>>>>> 
>>>>> 
>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Linsen
>>>>>> 
>>>>>>     Some of what you are saying like push down of ops like filter,
>>>>>> projection or partial aggregation below the storage engine scanner
>>>> level,
>>>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>>>> input
>>>>>> in these issues is most welcome.
>>>>>> 
>>>>>>     HBase in particular has the notion of
>>>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>>>> is
>>>>>> also in line with what other parallel database over nosql
>>>> implementations
>>>>>> like tajo do).
>>>>>>     A possible approach is to have the optimizer change the order of
>>>>>> the ops to place them below the storage engine scanner and let the SE
>>>> impl
>>>>>> deal with it internally.
>>>>>> 
>>>>>>     There are also some other pieces missing at the moment AFAIK,
>>>> like
>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>> 
>>>>>>     So in summary, you're absolutely right, and if you're
>>>> particularly
>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>>> interested
>>>>>> in collaborating.
>>>>>> 
>>>>>> Best
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi David,
>>>>>>> 
>>>>>>> Very nice to see your effort on this.
>>>>>>> 
>>>>>>> Hi Jacques,
>>>>>>> 
>>>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>>>> meet our production need. However, We find that implementing a
>>>> performant
>>>>>>> HBase storage engine is a not so straight-forward work, and requires
>>>> some
>>>>>>> workaround. The problem is in Scan interface.
>>>>>>> 
>>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>>> Storage
>>>>>>> engine provides output for a whole data source, a csv file for example.
>>>>>>> It's sufficient for input source like plain file, but for hbase, it's
>>>> not
>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>>> htable
>>>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>>>> part
>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>>> specifying
>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>>> even
>>>>>>> Join.
>>>>>>> 
>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>>>> of
>>>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>>>> more
>>>>>>> implementation-specific information would coupe into the plan
>>>>>> optimization
>>>>>>> & transformation phase. I guess that's the price to pay when
>>>> optimization
>>>>>>> comes, or is there other way I failed to see?
>>>>>>> 
>>>>>>> Please correct me if anything is wrong.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> 
>>>>>>> Lisen
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jacques
>>>>>>>> 
>>>>>>>>    I've submitted a fist pass patch to DRILL-15.
>>>>>>>>    I did this mostly because HBase will be my main target and
>>>>>> because
>>>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>>>> Have
>>>>>>>> some thoughts that I will post soon.
>>>>>>>>    btw: I still can't assign issues to myself in JIRA, did you
>>>>>> forget
>>>>>>>> to add me as a contributor?
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David
>>>>>>>> 
>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hey David,
>>>>>>>>> 
>>>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>>>> assign
>>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>>>> depends
>>>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>>>> idea.
>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>>>> some
>>>>>>>>> time to focus on it.
>>>>>>>>> 
>>>>>>>>> Jacques
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi All
>>>>>>>>>> 
>>>>>>>>>>   I have a new academic project for which I'd like to use drill
>>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>>> implementations
>>>>>>>>>> fit just right.
>>>>>>>>>>   To this goal I've been tinkering with the prototype trying to
>>>>>>>> find
>>>>>>>>>> where I'd be most useful.
>>>>>>>>>> 
>>>>>>>>>>   Here's where I'd like to start, if you agree:
>>>>>>>>>>   - implement HBase storage engine (DRILL-15)
>>>>>>>>>>           - start with simple scanning an push down of
>>>>>>>>>> selection/projection
>>>>>>>>>>   - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>>   - setup coding style in the wiki (formatting/imports etc,
>>>>>>>> DRILL-46)
>>>>>>>>>>   - create builders for all logical plan elements/make logical
>>>>>>>> plans
>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>>>> 
>>>>>>>>>>   Please let me know your thoughts, and if you agree please
>>>> assign
>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> David Alves
>>

Re: contribution

Posted by Jacques Nadeau <ja...@gmail.com>.

I've been under the weather the last few days and haven't made much
progress. Let me see if I can get you something tomorrow.

On Mar 15, 2013, at 2:36 PM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
>
>    Is there any chance we could get a preview of this physical plan stuff and basic plumbing for distributed execution before the weekend? maybe in a github branch somewhere?
>    I mean it doesn't have to be complete or even running, I'd just like to make some progress with other stuff and keeping it in line with whichever plumbing you already have would be great.
>
> Best
> David
>
> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
>> I'm working on some physical plan stuff as well as some basic plumbing for
>> distributed execution.  Its very in progress so I need to clean things up a
>> bit before we could collaborate/ divide and conquer on it.  Depending on
>> your timing and availability, maybe I could put some of this together in
>> the next couple days so that you could plug in rather than reinvent.  In
>> the meantime, pushing forward the builder stuff, additional test cases on
>> the reference interpreter and/or thinking through the logical plan storage
>> engine pushdown/rewrite could be very useful.
>>
>> Let me know your thoughts.
>>
>> thanks,
>> Jacques
>>
>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com> wrote:
>>
>>> Hi Jacques
>>>
>>>       I can assign issues to me now, thanks.
>>>       What you say wrt to the logical/physical/execution layers sounds
>>> good.
>>>       My main concern, for the moment is to have something working as
>>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>>> hbase cluster and send them work to do in some form (first step would be to
>>> treat is as a non distributed engine where each daemon runs an instance of
>>> the prototype).
>>>       Here's where I'd like to go next:
>>>       - lay the ground work for the daemons (scripts/rpc iface/wiring
>>> protocol).
>>>       - create an execution engine iface that allows to abstract future
>>> implementations, and make it available through the rpc iface. this would
>>> sit in front of the ref impl for now and would be replaced by cpp down the
>>> line.
>>>
>>>       I think we can probably concentrate on the capabilities iface a
>>> bit down the line but, as a first approach, I see it simply providing a
>>> simple set of ops that it is able to run internally.
>>>       How to abstract locality/partitioning/schema capabilities is till
>>> not clear to me though, thoughts?
>>>
>>> David
>>>
>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
>>>
>>>> I'm working on a presentation that will better illustrate the layers.
>>>> There are actually three key plans.  Thinking to date has been to break
>>>> the plans down into logical, physical and execution.  The third hasn't
>>> been
>>>> expressed well here and is entirely an internal domain to the execution
>>>> engine.  Following some classic methods: Logical expresses what we want
>>> to
>>>> do, Physical expresses how we want to do it (adding points of
>>>> parallelization but not specifying particular amounts of parallelization
>>> or
>>>> node by node assignments).  The execution engine is then responsible for
>>>> determining the amount of parallelization of a particular plan along with
>>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>>> and
>>>> specific data locality information, building sub-dags to be assigned to
>>>> individual nodes and execute the plan.
>>>>
>>>> So in the higher logical and physical levels, a single Scan and
>>> subsequent
>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>>> ignore the level of separation we're planning for the real execution
>>> layer.
>>>> This is the why the current ref impl turns a single Scan into potentially
>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>
>>>> The capabilities interface still needs to be defined for how a storage
>>>> engine reveals its logical capabilities and thus consumes part of the
>>> plan.
>>>>
>>>> J
>>>>
>>>>
>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
>>> wrote:
>>>>
>>>>> Hi Linsen
>>>>>
>>>>>      Some of what you are saying like push down of ops like filter,
>>>>> projection or partial aggregation below the storage engine scanner
>>> level,
>>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>>> input
>>>>> in these issues is most welcome.
>>>>>
>>>>>      HBase in particular has the notion of
>>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>>> is
>>>>> also in line with what other parallel database over nosql
>>> implementations
>>>>> like tajo do).
>>>>>      A possible approach is to have the optimizer change the order of
>>>>> the ops to place them below the storage engine scanner and let the SE
>>> impl
>>>>> deal with it internally.
>>>>>
>>>>>      There are also some other pieces missing at the moment AFAIK,
>>> like
>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>
>>>>>      So in summary, you're absolutely right, and if you're
>>> particularly
>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>> interested
>>>>> in collaborating.
>>>>>
>>>>> Best
>>>>> David
>>>>>
>>>>>
>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> Very nice to see your effort on this.
>>>>>>
>>>>>> Hi Jacques,
>>>>>>
>>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>>> meet our production need. However, We find that implementing a
>>> performant
>>>>>> HBase storage engine is a not so straight-forward work, and requires
>>> some
>>>>>> workaround. The problem is in Scan interface.
>>>>>>
>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>> Storage
>>>>>> engine provides output for a whole data source, a csv file for example.
>>>>>> It's sufficient for input source like plain file, but for hbase, it's
>>> not
>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>> htable
>>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>>> part
>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>> specifying
>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>> even
>>>>>> Join.
>>>>>>
>>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>>> of
>>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>>> more
>>>>>> implementation-specific information would coupe into the plan
>>>>> optimization
>>>>>> & transformation phase. I guess that's the price to pay when
>>> optimization
>>>>>> comes, or is there other way I failed to see?
>>>>>>
>>>>>> Please correct me if anything is wrong.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> Lisen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> Hi Jacques
>>>>>>>
>>>>>>>     I've submitted a fist pass patch to DRILL-15.
>>>>>>>     I did this mostly because HBase will be my main target and
>>>>> because
>>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>>> Have
>>>>>>> some thoughts that I will post soon.
>>>>>>>     btw: I still can't assign issues to myself in JIRA, did you
>>>>> forget
>>>>>>> to add me as a contributor?
>>>>>>>
>>>>>>> Best
>>>>>>> David
>>>>>>>
>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>>> assign
>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>>> depends
>>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>>> idea.
>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>>> some
>>>>>>>> time to focus on it.
>>>>>>>>
>>>>>>>> Jacques
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>>    I have a new academic project for which I'd like to use drill
>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>> implementations
>>>>>>>>> fit just right.
>>>>>>>>>    To this goal I've been tinkering with the prototype trying to
>>>>>>> find
>>>>>>>>> where I'd be most useful.
>>>>>>>>>
>>>>>>>>>    Here's where I'd like to start, if you agree:
>>>>>>>>>    - implement HBase storage engine (DRILL-15)
>>>>>>>>>            - start with simple scanning an push down of
>>>>>>>>> selection/projection
>>>>>>>>>    - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>    - setup coding style in the wiki (formatting/imports etc,
>>>>>>> DRILL-46)
>>>>>>>>>    - create builders for all logical plan elements/make logical
>>>>>>> plans
>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>>>
>>>>>>>>>    Please let me know your thoughts, and if you agree please
>>> assign
>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> David Alves
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Hi Jacques

	Is there any chance we could get a preview of this physical plan stuff and basic plumbing for distributed execution before the weekend? maybe in a github branch somewhere?
	I mean it doesn't have to be complete or even running, I'd just like to make some progress with other stuff and keeping it in line with whichever plumbing you already have would be great.
	
Best
David

On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm working on some physical plan stuff as well as some basic plumbing for
> distributed execution.  Its very in progress so I need to clean things up a
> bit before we could collaborate/ divide and conquer on it.  Depending on
> your timing and availability, maybe I could put some of this together in
> the next couple days so that you could plug in rather than reinvent.  In
> the meantime, pushing forward the builder stuff, additional test cases on
> the reference interpreter and/or thinking through the logical plan storage
> engine pushdown/rewrite could be very useful.
> 
> Let me know your thoughts.
> 
> thanks,
> Jacques
> 
> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com> wrote:
> 
>> Hi Jacques
>> 
>>        I can assign issues to me now, thanks.
>>        What you say wrt to the logical/physical/execution layers sounds
>> good.
>>        My main concern, for the moment is to have something working as
>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>> hbase cluster and send them work to do in some form (first step would be to
>> treat is as a non distributed engine where each daemon runs an instance of
>> the prototype).
>>        Here's where I'd like to go next:
>>        - lay the ground work for the daemons (scripts/rpc iface/wiring
>> protocol).
>>        - create an execution engine iface that allows to abstract future
>> implementations, and make it available through the rpc iface. this would
>> sit in front of the ref impl for now and would be replaced by cpp down the
>> line.
>> 
>>        I think we can probably concentrate on the capabilities iface a
>> bit down the line but, as a first approach, I see it simply providing a
>> simple set of ops that it is able to run internally.
>>        How to abstract locality/partitioning/schema capabilities is till
>> not clear to me though, thoughts?
>> 
>> David
>> 
>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
>> 
>>> I'm working on a presentation that will better illustrate the layers.
>>> There are actually three key plans.  Thinking to date has been to break
>>> the plans down into logical, physical and execution.  The third hasn't
>> been
>>> expressed well here and is entirely an internal domain to the execution
>>> engine.  Following some classic methods: Logical expresses what we want
>> to
>>> do, Physical expresses how we want to do it (adding points of
>>> parallelization but not specifying particular amounts of parallelization
>> or
>>> node by node assignments).  The execution engine is then responsible for
>>> determining the amount of parallelization of a particular plan along with
>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>> and
>>> specific data locality information, building sub-dags to be assigned to
>>> individual nodes and execute the plan.
>>> 
>>> So in the higher logical and physical levels, a single Scan and
>> subsequent
>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>> ignore the level of separation we're planning for the real execution
>> layer.
>>> This is the why the current ref impl turns a single Scan into potentially
>>> a union of ScanROPs... not elegant but logically correct.)
>>> 
>>> The capabilities interface still needs to be defined for how a storage
>>> engine reveals its logical capabilities and thus consumes part of the
>> plan.
>>> 
>>> J
>>> 
>>> 
>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
>> wrote:
>>> 
>>>> Hi Linsen
>>>> 
>>>>       Some of what you are saying like push down of ops like filter,
>>>> projection or partial aggregation below the storage engine scanner
>> level,
>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>> input
>>>> in these issues is most welcome.
>>>> 
>>>>       HBase in particular has the notion of
>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>> is
>>>> also in line with what other parallel database over nosql
>> implementations
>>>> like tajo do).
>>>>       A possible approach is to have the optimizer change the order of
>>>> the ops to place them below the storage engine scanner and let the SE
>> impl
>>>> deal with it internally.
>>>> 
>>>>       There are also some other pieces missing at the moment AFAIK,
>> like
>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>> 
>>>>       So in summary, you're absolutely right, and if you're
>> particularly
>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>> interested
>>>> in collaborating.
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> 
>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>>>> 
>>>>> Hi David,
>>>>> 
>>>>> Very nice to see your effort on this.
>>>>> 
>>>>> Hi Jacques,
>>>>> 
>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>> meet our production need. However, We find that implementing a
>> performant
>>>>> HBase storage engine is a not so straight-forward work, and requires
>> some
>>>>> workaround. The problem is in Scan interface.
>>>>> 
>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>> Storage
>>>>> engine provides output for a whole data source, a csv file for example.
>>>>> It's sufficient for input source like plain file, but for hbase, it's
>> not
>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>> htable
>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>> part
>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>> specifying
>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>> even
>>>>> Join.
>>>>> 
>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>> of
>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>> more
>>>>> implementation-specific information would coupe into the plan
>>>> optimization
>>>>> & transformation phase. I guess that's the price to pay when
>> optimization
>>>>> comes, or is there other way I failed to see?
>>>>> 
>>>>> Please correct me if anything is wrong.
>>>>> 
>>>>> thanks,
>>>>> 
>>>>> Lisen
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Jacques
>>>>>> 
>>>>>>      I've submitted a fist pass patch to DRILL-15.
>>>>>>      I did this mostly because HBase will be my main target and
>>>> because
>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>> Have
>>>>>> some thoughts that I will post soon.
>>>>>>      btw: I still can't assign issues to myself in JIRA, did you
>>>> forget
>>>>>> to add me as a contributor?
>>>>>> 
>>>>>> Best
>>>>>> David
>>>>>> 
>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>>>> 
>>>>>>> Hey David,
>>>>>>> 
>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>> assign
>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>> depends
>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>> idea.
>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>> some
>>>>>>> time to focus on it.
>>>>>>> 
>>>>>>> Jacques
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi All
>>>>>>>> 
>>>>>>>>     I have a new academic project for which I'd like to use drill
>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>> implementations
>>>>>>>> fit just right.
>>>>>>>>     To this goal I've been tinkering with the prototype trying to
>>>>>> find
>>>>>>>> where I'd be most useful.
>>>>>>>> 
>>>>>>>>     Here's where I'd like to start, if you agree:
>>>>>>>>     - implement HBase storage engine (DRILL-15)
>>>>>>>>             - start with simple scanning an push down of
>>>>>>>> selection/projection
>>>>>>>>     - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>     - setup coding style in the wiki (formatting/imports etc,
>>>>>> DRILL-46)
>>>>>>>>     - create builders for all logical plan elements/make logical
>>>>>> plans
>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>> 
>>>>>>>>     Please let me know your thoughts, and if you agree please
>> assign
>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David Alves
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

Don't worry Tim, it is still very much on my radar.  Just well ahead of the
ref interpreter stuff.  Let me see what I can slice up in the next few days.

J

On Wed, Mar 13, 2013 at 2:40 PM, Timothy Chen <tn...@gmail.com> wrote:

> Looking forward to the plumbing as well, since my json scan op sat there
> for a while now :)
>
> Tim
>
>
> On Wed, Mar 13, 2013 at 2:30 PM, David Alves <da...@gmail.com>
> wrote:
>
> > Getting the basic plumbing to a point where we could work together on
> > it/use it elsewhere as soon as you can would be awesome.
> > As soon as I get that I can start on the daemons/scripts.
> > I'll  focus on the SE iface and on HBase pushdown for the moment.
> >
> > -david
> >
> > On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > > I'm working on some physical plan stuff as well as some basic plumbing
> > for
> > > distributed execution.  Its very in progress so I need to clean things
> > up a
> > > bit before we could collaborate/ divide and conquer on it.  Depending
> on
> > > your timing and availability, maybe I could put some of this together
> in
> > > the next couple days so that you could plug in rather than reinvent.
>  In
> > > the meantime, pushing forward the builder stuff, additional test cases
> on
> > > the reference interpreter and/or thinking through the logical plan
> > storage
> > > engine pushdown/rewrite could be very useful.
> > >
> > > Let me know your thoughts.
> > >
> > > thanks,
> > > Jacques
> > >
> > > On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com>
> > wrote:
> > >
> > >> Hi Jacques
> > >>
> > >>        I can assign issues to me now, thanks.
> > >>        What you say wrt to the logical/physical/execution layers
> sounds
> > >> good.
> > >>        My main concern, for the moment is to have something working as
> > >> fast as possible, i.e. some daemons that I'd be able to deploy to a
> > working
> > >> hbase cluster and send them work to do in some form (first step would
> > be to
> > >> treat is as a non distributed engine where each daemon runs an
> instance
> > of
> > >> the prototype).
> > >>        Here's where I'd like to go next:
> > >>        - lay the ground work for the daemons (scripts/rpc iface/wiring
> > >> protocol).
> > >>        - create an execution engine iface that allows to abstract
> future
> > >> implementations, and make it available through the rpc iface. this
> would
> > >> sit in front of the ref impl for now and would be replaced by cpp down
> > the
> > >> line.
> > >>
> > >>        I think we can probably concentrate on the capabilities iface a
> > >> bit down the line but, as a first approach, I see it simply providing
> a
> > >> simple set of ops that it is able to run internally.
> > >>        How to abstract locality/partitioning/schema capabilities is
> till
> > >> not clear to me though, thoughts?
> > >>
> > >> David
> > >>
> > >> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >>
> > >>> I'm working on a presentation that will better illustrate the layers.
> > >>> There are actually three key plans.  Thinking to date has been to
> break
> > >>> the plans down into logical, physical and execution.  The third
> hasn't
> > >> been
> > >>> expressed well here and is entirely an internal domain to the
> execution
> > >>> engine.  Following some classic methods: Logical expresses what we
> want
> > >> to
> > >>> do, Physical expresses how we want to do it (adding points of
> > >>> parallelization but not specifying particular amounts of
> > parallelization
> > >> or
> > >>> node by node assignments).  The execution engine is then responsible
> > for
> > >>> determining the amount of parallelization of a particular plan along
> > with
> > >>> system load (likely leveraging Berkeley's Sparrow work), task
> priority
> > >> and
> > >>> specific data locality information, building sub-dags to be assigned
> to
> > >>> individual nodes and execute the plan.
> > >>>
> > >>> So in the higher logical and physical levels, a single Scan and
> > >> subsequent
> > >>> ScanPOP should be okay...  (ScanROPs have a separate problems since
> > they
> > >>> ignore the level of separation we're planning for the real execution
> > >> layer.
> > >>> This is the why the current ref impl turns a single Scan into
> > potentially
> > >>> a union of ScanROPs... not elegant but logically correct.)
> > >>>
> > >>> The capabilities interface still needs to be defined for how a
> storage
> > >>> engine reveals its logical capabilities and thus consumes part of the
> > >> plan.
> > >>>
> > >>> J
> > >>>
> > >>>
> > >>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com
> >
> > >> wrote:
> > >>>
> > >>>> Hi Linsen
> > >>>>
> > >>>>       Some of what you are saying like push down of ops like filter,
> > >>>> projection or partial aggregation below the storage engine scanner
> > >> level,
> > >>>> or sub tree execution are actively being discussed in issues
> DRILL-13
> > >>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
> your
> > >> input
> > >>>> in these issues is most welcome.
> > >>>>
> > >>>>       HBase in particular has the notion of
> > >>>> enpoints/coprocessors/filters that allow pushing this down easily
> > (this
> > >> is
> > >>>> also in line with what other parallel database over nosql
> > >> implementations
> > >>>> like tajo do).
> > >>>>       A possible approach is to have the optimizer change the order
> of
> > >>>> the ops to place them below the storage engine scanner and let the
> SE
> > >> impl
> > >>>> deal with it internally.
> > >>>>
> > >>>>       There are also some other pieces missing at the moment AFAIK,
> > >> like
> > >>>> a distributed metadata store, the drill daemons, wiring, etc.
> > >>>>
> > >>>>       So in summary, you're absolutely right, and if you're
> > >> particularly
> > >>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> > >> interested
> > >>>> in collaborating.
> > >>>>
> > >>>> Best
> > >>>> David
> > >>>>
> > >>>>
> > >>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
> > >>>>
> > >>>>> Hi David,
> > >>>>>
> > >>>>> Very nice to see your effort on this.
> > >>>>>
> > >>>>> Hi Jacques,
> > >>>>>
> > >>>>> we are also extending drill prototype, to see if there is any
> chance
> > to
> > >>>>> meet our production need. However, We find that implementing a
> > >> performant
> > >>>>> HBase storage engine is a not so straight-forward work, and
> requires
> > >> some
> > >>>>> workaround. The problem is in Scan interface.
> > >>>>>
> > >>>>> In drill's physical plan model, ScanROP is in charge of table scan.
> > >>>> Storage
> > >>>>> engine provides output for a whole data source, a csv file for
> > example.
> > >>>>> It's sufficient for input source like plain file, but for hbase,
> it's
> > >> not
> > >>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
> > >> htable
> > >>>>> into drill. Storage engines like HBase should have some ablility to
> > do
> > >>>> part
> > >>>>> of the DrQL query, like Filter, if a filter can be performed by
> > >>>> specifying
> > >>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
> > >> even
> > >>>>> Join.
> > >>>>>
> > >>>>> Generally, it would be more clear if a ScanROP is mapped to a
> sub-DAG
> > >> of
> > >>>>> logical plan DAG instead of a single Scan node in logical plan. If
> > so,
> > >>>> more
> > >>>>> implementation-specific information would coupe into the plan
> > >>>> optimization
> > >>>>> & transformation phase. I guess that's the price to pay when
> > >> optimization
> > >>>>> comes, or is there other way I failed to see?
> > >>>>>
> > >>>>> Please correct me if anything is wrong.
> > >>>>>
> > >>>>> thanks,
> > >>>>>
> > >>>>> Lisen
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
> davidralves@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Hi Jacques
> > >>>>>>
> > >>>>>>      I've submitted a fist pass patch to DRILL-15.
> > >>>>>>      I did this mostly because HBase will be my main target and
> > >>>> because
> > >>>>>> I wanted to get a feel of what would be a nice interface for
> > DRILL-13.
> > >>>> Have
> > >>>>>> some thoughts that I will post soon.
> > >>>>>>      btw: I still can't assign issues to myself in JIRA, did you
> > >>>> forget
> > >>>>>> to add me as a contributor?
> > >>>>>>
> > >>>>>> Best
> > >>>>>> David
> > >>>>>>
> > >>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>>>>>
> > >>>>>>> Hey David,
> > >>>>>>>
> > >>>>>>> These sound good.  I've add you as a contributor on jira so you
> can
> > >>>>>> assign
> > >>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
>  15
> > >>>>>> depends
> > >>>>>>> on 13 and working on the two hand in hand would probably be a
> good
> > >>>> idea.
> > >>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
> > have
> > >>>>>> some
> > >>>>>>> time to focus on it.
> > >>>>>>>
> > >>>>>>> Jacques
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> > davidralves@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi All
> > >>>>>>>>
> > >>>>>>>>     I have a new academic project for which I'd like to use
> drill
> > >>>>>>>> since none of the other parallel database over hadoop/nosql
> > >>>>>> implementations
> > >>>>>>>> fit just right.
> > >>>>>>>>     To this goal I've been tinkering with the prototype trying
> to
> > >>>>>> find
> > >>>>>>>> where I'd be most useful.
> > >>>>>>>>
> > >>>>>>>>     Here's where I'd like to start, if you agree:
> > >>>>>>>>     - implement HBase storage engine (DRILL-15)
> > >>>>>>>>             - start with simple scanning an push down of
> > >>>>>>>> selection/projection
> > >>>>>>>>     - implement the LogicalPlanBuilder (DRILL-45)
> > >>>>>>>>     - setup coding style in the wiki (formatting/imports etc,
> > >>>>>> DRILL-46)
> > >>>>>>>>     - create builders for all logical plan elements/make logical
> > >>>>>> plans
> > >>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> > first).
> > >>>>>>>>
> > >>>>>>>>     Please let me know your thoughts, and if you agree please
> > >> assign
> > >>>>>>>> the issues to me (it seems that I can't assign them myself).
> > >>>>>>>>
> > >>>>>>>> Best
> > >>>>>>>> David Alves
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: contribution

Posted by Timothy Chen <tn...@gmail.com>.

Looking forward to the plumbing as well, since my json scan op sat there
for a while now :)

Tim


On Wed, Mar 13, 2013 at 2:30 PM, David Alves <da...@gmail.com> wrote:

> Getting the basic plumbing to a point where we could work together on
> it/use it elsewhere as soon as you can would be awesome.
> As soon as I get that I can start on the daemons/scripts.
> I'll  focus on the SE iface and on HBase pushdown for the moment.
>
> -david
>
> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > I'm working on some physical plan stuff as well as some basic plumbing
> for
> > distributed execution.  Its very in progress so I need to clean things
> up a
> > bit before we could collaborate/ divide and conquer on it.  Depending on
> > your timing and availability, maybe I could put some of this together in
> > the next couple days so that you could plug in rather than reinvent.  In
> > the meantime, pushing forward the builder stuff, additional test cases on
> > the reference interpreter and/or thinking through the logical plan
> storage
> > engine pushdown/rewrite could be very useful.
> >
> > Let me know your thoughts.
> >
> > thanks,
> > Jacques
> >
> > On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi Jacques
> >>
> >>        I can assign issues to me now, thanks.
> >>        What you say wrt to the logical/physical/execution layers sounds
> >> good.
> >>        My main concern, for the moment is to have something working as
> >> fast as possible, i.e. some daemons that I'd be able to deploy to a
> working
> >> hbase cluster and send them work to do in some form (first step would
> be to
> >> treat is as a non distributed engine where each daemon runs an instance
> of
> >> the prototype).
> >>        Here's where I'd like to go next:
> >>        - lay the ground work for the daemons (scripts/rpc iface/wiring
> >> protocol).
> >>        - create an execution engine iface that allows to abstract future
> >> implementations, and make it available through the rpc iface. this would
> >> sit in front of the ref impl for now and would be replaced by cpp down
> the
> >> line.
> >>
> >>        I think we can probably concentrate on the capabilities iface a
> >> bit down the line but, as a first approach, I see it simply providing a
> >> simple set of ops that it is able to run internally.
> >>        How to abstract locality/partitioning/schema capabilities is till
> >> not clear to me though, thoughts?
> >>
> >> David
> >>
> >> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>
> >>> I'm working on a presentation that will better illustrate the layers.
> >>> There are actually three key plans.  Thinking to date has been to break
> >>> the plans down into logical, physical and execution.  The third hasn't
> >> been
> >>> expressed well here and is entirely an internal domain to the execution
> >>> engine.  Following some classic methods: Logical expresses what we want
> >> to
> >>> do, Physical expresses how we want to do it (adding points of
> >>> parallelization but not specifying particular amounts of
> parallelization
> >> or
> >>> node by node assignments).  The execution engine is then responsible
> for
> >>> determining the amount of parallelization of a particular plan along
> with
> >>> system load (likely leveraging Berkeley's Sparrow work), task priority
> >> and
> >>> specific data locality information, building sub-dags to be assigned to
> >>> individual nodes and execute the plan.
> >>>
> >>> So in the higher logical and physical levels, a single Scan and
> >> subsequent
> >>> ScanPOP should be okay...  (ScanROPs have a separate problems since
> they
> >>> ignore the level of separation we're planning for the real execution
> >> layer.
> >>> This is the why the current ref impl turns a single Scan into
> potentially
> >>> a union of ScanROPs... not elegant but logically correct.)
> >>>
> >>> The capabilities interface still needs to be defined for how a storage
> >>> engine reveals its logical capabilities and thus consumes part of the
> >> plan.
> >>>
> >>> J
> >>>
> >>>
> >>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi Linsen
> >>>>
> >>>>       Some of what you are saying like push down of ops like filter,
> >>>> projection or partial aggregation below the storage engine scanner
> >> level,
> >>>> or sub tree execution are actively being discussed in issues DRILL-13
> >>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
> >> input
> >>>> in these issues is most welcome.
> >>>>
> >>>>       HBase in particular has the notion of
> >>>> enpoints/coprocessors/filters that allow pushing this down easily
> (this
> >> is
> >>>> also in line with what other parallel database over nosql
> >> implementations
> >>>> like tajo do).
> >>>>       A possible approach is to have the optimizer change the order of
> >>>> the ops to place them below the storage engine scanner and let the SE
> >> impl
> >>>> deal with it internally.
> >>>>
> >>>>       There are also some other pieces missing at the moment AFAIK,
> >> like
> >>>> a distributed metadata store, the drill daemons, wiring, etc.
> >>>>
> >>>>       So in summary, you're absolutely right, and if you're
> >> particularly
> >>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> >> interested
> >>>> in collaborating.
> >>>>
> >>>> Best
> >>>> David
> >>>>
> >>>>
> >>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
> >>>>
> >>>>> Hi David,
> >>>>>
> >>>>> Very nice to see your effort on this.
> >>>>>
> >>>>> Hi Jacques,
> >>>>>
> >>>>> we are also extending drill prototype, to see if there is any chance
> to
> >>>>> meet our production need. However, We find that implementing a
> >> performant
> >>>>> HBase storage engine is a not so straight-forward work, and requires
> >> some
> >>>>> workaround. The problem is in Scan interface.
> >>>>>
> >>>>> In drill's physical plan model, ScanROP is in charge of table scan.
> >>>> Storage
> >>>>> engine provides output for a whole data source, a csv file for
> example.
> >>>>> It's sufficient for input source like plain file, but for hbase, it's
> >> not
> >>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
> >> htable
> >>>>> into drill. Storage engines like HBase should have some ablility to
> do
> >>>> part
> >>>>> of the DrQL query, like Filter, if a filter can be performed by
> >>>> specifying
> >>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
> >> even
> >>>>> Join.
> >>>>>
> >>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
> >> of
> >>>>> logical plan DAG instead of a single Scan node in logical plan. If
> so,
> >>>> more
> >>>>> implementation-specific information would coupe into the plan
> >>>> optimization
> >>>>> & transformation phase. I guess that's the price to pay when
> >> optimization
> >>>>> comes, or is there other way I failed to see?
> >>>>>
> >>>>> Please correct me if anything is wrong.
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> Lisen
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Jacques
> >>>>>>
> >>>>>>      I've submitted a fist pass patch to DRILL-15.
> >>>>>>      I did this mostly because HBase will be my main target and
> >>>> because
> >>>>>> I wanted to get a feel of what would be a nice interface for
> DRILL-13.
> >>>> Have
> >>>>>> some thoughts that I will post soon.
> >>>>>>      btw: I still can't assign issues to myself in JIRA, did you
> >>>> forget
> >>>>>> to add me as a contributor?
> >>>>>>
> >>>>>> Best
> >>>>>> David
> >>>>>>
> >>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>>>>>
> >>>>>>> Hey David,
> >>>>>>>
> >>>>>>> These sound good.  I've add you as a contributor on jira so you can
> >>>>>> assign
> >>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
> >>>>>> depends
> >>>>>>> on 13 and working on the two hand in hand would probably be a good
> >>>> idea.
> >>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
> have
> >>>>>> some
> >>>>>>> time to focus on it.
> >>>>>>>
> >>>>>>> Jacques
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> davidralves@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi All
> >>>>>>>>
> >>>>>>>>     I have a new academic project for which I'd like to use drill
> >>>>>>>> since none of the other parallel database over hadoop/nosql
> >>>>>> implementations
> >>>>>>>> fit just right.
> >>>>>>>>     To this goal I've been tinkering with the prototype trying to
> >>>>>> find
> >>>>>>>> where I'd be most useful.
> >>>>>>>>
> >>>>>>>>     Here's where I'd like to start, if you agree:
> >>>>>>>>     - implement HBase storage engine (DRILL-15)
> >>>>>>>>             - start with simple scanning an push down of
> >>>>>>>> selection/projection
> >>>>>>>>     - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>>>     - setup coding style in the wiki (formatting/imports etc,
> >>>>>> DRILL-46)
> >>>>>>>>     - create builders for all logical plan elements/make logical
> >>>>>> plans
> >>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> first).
> >>>>>>>>
> >>>>>>>>     Please let me know your thoughts, and if you agree please
> >> assign
> >>>>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> David Alves
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Getting the basic plumbing to a point where we could work together on it/use it elsewhere as soon as you can would be awesome.
As soon as I get that I can start on the daemons/scripts.
I'll  focus on the SE iface and on HBase pushdown for the moment.

-david

On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm working on some physical plan stuff as well as some basic plumbing for
> distributed execution.  Its very in progress so I need to clean things up a
> bit before we could collaborate/ divide and conquer on it.  Depending on
> your timing and availability, maybe I could put some of this together in
> the next couple days so that you could plug in rather than reinvent.  In
> the meantime, pushing forward the builder stuff, additional test cases on
> the reference interpreter and/or thinking through the logical plan storage
> engine pushdown/rewrite could be very useful.
> 
> Let me know your thoughts.
> 
> thanks,
> Jacques
> 
> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com> wrote:
> 
>> Hi Jacques
>> 
>>        I can assign issues to me now, thanks.
>>        What you say wrt to the logical/physical/execution layers sounds
>> good.
>>        My main concern, for the moment is to have something working as
>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>> hbase cluster and send them work to do in some form (first step would be to
>> treat is as a non distributed engine where each daemon runs an instance of
>> the prototype).
>>        Here's where I'd like to go next:
>>        - lay the ground work for the daemons (scripts/rpc iface/wiring
>> protocol).
>>        - create an execution engine iface that allows to abstract future
>> implementations, and make it available through the rpc iface. this would
>> sit in front of the ref impl for now and would be replaced by cpp down the
>> line.
>> 
>>        I think we can probably concentrate on the capabilities iface a
>> bit down the line but, as a first approach, I see it simply providing a
>> simple set of ops that it is able to run internally.
>>        How to abstract locality/partitioning/schema capabilities is till
>> not clear to me though, thoughts?
>> 
>> David
>> 
>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
>> 
>>> I'm working on a presentation that will better illustrate the layers.
>>> There are actually three key plans.  Thinking to date has been to break
>>> the plans down into logical, physical and execution.  The third hasn't
>> been
>>> expressed well here and is entirely an internal domain to the execution
>>> engine.  Following some classic methods: Logical expresses what we want
>> to
>>> do, Physical expresses how we want to do it (adding points of
>>> parallelization but not specifying particular amounts of parallelization
>> or
>>> node by node assignments).  The execution engine is then responsible for
>>> determining the amount of parallelization of a particular plan along with
>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>> and
>>> specific data locality information, building sub-dags to be assigned to
>>> individual nodes and execute the plan.
>>> 
>>> So in the higher logical and physical levels, a single Scan and
>> subsequent
>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>> ignore the level of separation we're planning for the real execution
>> layer.
>>> This is the why the current ref impl turns a single Scan into potentially
>>> a union of ScanROPs... not elegant but logically correct.)
>>> 
>>> The capabilities interface still needs to be defined for how a storage
>>> engine reveals its logical capabilities and thus consumes part of the
>> plan.
>>> 
>>> J
>>> 
>>> 
>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
>> wrote:
>>> 
>>>> Hi Linsen
>>>> 
>>>>       Some of what you are saying like push down of ops like filter,
>>>> projection or partial aggregation below the storage engine scanner
>> level,
>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>> input
>>>> in these issues is most welcome.
>>>> 
>>>>       HBase in particular has the notion of
>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>> is
>>>> also in line with what other parallel database over nosql
>> implementations
>>>> like tajo do).
>>>>       A possible approach is to have the optimizer change the order of
>>>> the ops to place them below the storage engine scanner and let the SE
>> impl
>>>> deal with it internally.
>>>> 
>>>>       There are also some other pieces missing at the moment AFAIK,
>> like
>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>> 
>>>>       So in summary, you're absolutely right, and if you're
>> particularly
>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>> interested
>>>> in collaborating.
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> 
>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>>>> 
>>>>> Hi David,
>>>>> 
>>>>> Very nice to see your effort on this.
>>>>> 
>>>>> Hi Jacques,
>>>>> 
>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>> meet our production need. However, We find that implementing a
>> performant
>>>>> HBase storage engine is a not so straight-forward work, and requires
>> some
>>>>> workaround. The problem is in Scan interface.
>>>>> 
>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>> Storage
>>>>> engine provides output for a whole data source, a csv file for example.
>>>>> It's sufficient for input source like plain file, but for hbase, it's
>> not
>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>> htable
>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>> part
>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>> specifying
>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>> even
>>>>> Join.
>>>>> 
>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>> of
>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>> more
>>>>> implementation-specific information would coupe into the plan
>>>> optimization
>>>>> & transformation phase. I guess that's the price to pay when
>> optimization
>>>>> comes, or is there other way I failed to see?
>>>>> 
>>>>> Please correct me if anything is wrong.
>>>>> 
>>>>> thanks,
>>>>> 
>>>>> Lisen
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Jacques
>>>>>> 
>>>>>>      I've submitted a fist pass patch to DRILL-15.
>>>>>>      I did this mostly because HBase will be my main target and
>>>> because
>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>> Have
>>>>>> some thoughts that I will post soon.
>>>>>>      btw: I still can't assign issues to myself in JIRA, did you
>>>> forget
>>>>>> to add me as a contributor?
>>>>>> 
>>>>>> Best
>>>>>> David
>>>>>> 
>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>>>> 
>>>>>>> Hey David,
>>>>>>> 
>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>> assign
>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>> depends
>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>> idea.
>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>> some
>>>>>>> time to focus on it.
>>>>>>> 
>>>>>>> Jacques
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi All
>>>>>>>> 
>>>>>>>>     I have a new academic project for which I'd like to use drill
>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>> implementations
>>>>>>>> fit just right.
>>>>>>>>     To this goal I've been tinkering with the prototype trying to
>>>>>> find
>>>>>>>> where I'd be most useful.
>>>>>>>> 
>>>>>>>>     Here's where I'd like to start, if you agree:
>>>>>>>>     - implement HBase storage engine (DRILL-15)
>>>>>>>>             - start with simple scanning an push down of
>>>>>>>> selection/projection
>>>>>>>>     - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>     - setup coding style in the wiki (formatting/imports etc,
>>>>>> DRILL-46)
>>>>>>>>     - create builders for all logical plan elements/make logical
>>>>>> plans
>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>> 
>>>>>>>>     Please let me know your thoughts, and if you agree please
>> assign
>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David Alves
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

I'm working on some physical plan stuff as well as some basic plumbing for
distributed execution.  Its very in progress so I need to clean things up a
bit before we could collaborate/ divide and conquer on it.  Depending on
your timing and availability, maybe I could put some of this together in
the next couple days so that you could plug in rather than reinvent.  In
the meantime, pushing forward the builder stuff, additional test cases on
the reference interpreter and/or thinking through the logical plan storage
engine pushdown/rewrite could be very useful.

Let me know your thoughts.

thanks,
Jacques

On Wed, Mar 13, 2013 at 9:47 AM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
>
>         I can assign issues to me now, thanks.
>         What you say wrt to the logical/physical/execution layers sounds
> good.
>         My main concern, for the moment is to have something working as
> fast as possible, i.e. some daemons that I'd be able to deploy to a working
> hbase cluster and send them work to do in some form (first step would be to
> treat is as a non distributed engine where each daemon runs an instance of
> the prototype).
>         Here's where I'd like to go next:
>         - lay the ground work for the daemons (scripts/rpc iface/wiring
> protocol).
>         - create an execution engine iface that allows to abstract future
> implementations, and make it available through the rpc iface. this would
> sit in front of the ref impl for now and would be replaced by cpp down the
> line.
>
>         I think we can probably concentrate on the capabilities iface a
> bit down the line but, as a first approach, I see it simply providing a
> simple set of ops that it is able to run internally.
>         How to abstract locality/partitioning/schema capabilities is till
> not clear to me though, thoughts?
>
> David
>
> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > I'm working on a presentation that will better illustrate the layers.
> > There are actually three key plans.  Thinking to date has been to break
> > the plans down into logical, physical and execution.  The third hasn't
> been
> > expressed well here and is entirely an internal domain to the execution
> > engine.  Following some classic methods: Logical expresses what we want
> to
> > do, Physical expresses how we want to do it (adding points of
> > parallelization but not specifying particular amounts of parallelization
> or
> > node by node assignments).  The execution engine is then responsible for
> > determining the amount of parallelization of a particular plan along with
> > system load (likely leveraging Berkeley's Sparrow work), task priority
> and
> > specific data locality information, building sub-dags to be assigned to
> > individual nodes and execute the plan.
> >
> > So in the higher logical and physical levels, a single Scan and
> subsequent
> > ScanPOP should be okay...  (ScanROPs have a separate problems since they
> > ignore the level of separation we're planning for the real execution
> layer.
> > This is the why the current ref impl turns a single Scan into potentially
> > a union of ScanROPs... not elegant but logically correct.)
> >
> > The capabilities interface still needs to be defined for how a storage
> > engine reveals its logical capabilities and thus consumes part of the
> plan.
> >
> > J
> >
> >
> > On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi Linsen
> >>
> >>        Some of what you are saying like push down of ops like filter,
> >> projection or partial aggregation below the storage engine scanner
> level,
> >> or sub tree execution are actively being discussed in issues DRILL-13
> >> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
> input
> >> in these issues is most welcome.
> >>
> >>        HBase in particular has the notion of
> >> enpoints/coprocessors/filters that allow pushing this down easily (this
> is
> >> also in line with what other parallel database over nosql
> implementations
> >> like tajo do).
> >>        A possible approach is to have the optimizer change the order of
> >> the ops to place them below the storage engine scanner and let the SE
> impl
> >> deal with it internally.
> >>
> >>        There are also some other pieces missing at the moment AFAIK,
> like
> >> a distributed metadata store, the drill daemons, wiring, etc.
> >>
> >>        So in summary, you're absolutely right, and if you're
> particularly
> >> interested in the HBase SE impl (as I am, for the moment) I'd be
> interested
> >> in collaborating.
> >>
> >> Best
> >> David
> >>
> >>
> >> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
> >>
> >>> Hi David,
> >>>
> >>> Very nice to see your effort on this.
> >>>
> >>> Hi Jacques,
> >>>
> >>> we are also extending drill prototype, to see if there is any chance to
> >>> meet our production need. However, We find that implementing a
> performant
> >>> HBase storage engine is a not so straight-forward work, and requires
> some
> >>> workaround. The problem is in Scan interface.
> >>>
> >>> In drill's physical plan model, ScanROP is in charge of table scan.
> >> Storage
> >>> engine provides output for a whole data source, a csv file for example.
> >>> It's sufficient for input source like plain file, but for hbase, it's
> not
> >>> very efficient, if not impossible, to let ScanROP retrieve a whole
> htable
> >>> into drill. Storage engines like HBase should have some ablility to do
> >> part
> >>> of the DrQL query, like Filter, if a filter can be performed by
> >> specifying
> >>> startRowKey and endRowKey. Storage engine like mysql could do more,
> even
> >>> Join.
> >>>
> >>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
> of
> >>> logical plan DAG instead of a single Scan node in logical plan. If so,
> >> more
> >>> implementation-specific information would coupe into the plan
> >> optimization
> >>> & transformation phase. I guess that's the price to pay when
> optimization
> >>> comes, or is there other way I failed to see?
> >>>
> >>> Please correct me if anything is wrong.
> >>>
> >>> thanks,
> >>>
> >>> Lisen
> >>>
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi Jacques
> >>>>
> >>>>       I've submitted a fist pass patch to DRILL-15.
> >>>>       I did this mostly because HBase will be my main target and
> >> because
> >>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
> >> Have
> >>>> some thoughts that I will post soon.
> >>>>       btw: I still can't assign issues to myself in JIRA, did you
> >> forget
> >>>> to add me as a contributor?
> >>>>
> >>>> Best
> >>>> David
> >>>>
> >>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>
> >>>>> Hey David,
> >>>>>
> >>>>> These sound good.  I've add you as a contributor on jira so you can
> >>>> assign
> >>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
> >>>> depends
> >>>>> on 13 and working on the two hand in hand would probably be a good
> >> idea.
> >>>>> Maybe we could do a design discussion on 15 and 13 here once you have
> >>>> some
> >>>>> time to focus on it.
> >>>>>
> >>>>> Jacques
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi All
> >>>>>>
> >>>>>>      I have a new academic project for which I'd like to use drill
> >>>>>> since none of the other parallel database over hadoop/nosql
> >>>> implementations
> >>>>>> fit just right.
> >>>>>>      To this goal I've been tinkering with the prototype trying to
> >>>> find
> >>>>>> where I'd be most useful.
> >>>>>>
> >>>>>>      Here's where I'd like to start, if you agree:
> >>>>>>      - implement HBase storage engine (DRILL-15)
> >>>>>>              - start with simple scanning an push down of
> >>>>>> selection/projection
> >>>>>>      - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>      - setup coding style in the wiki (formatting/imports etc,
> >>>> DRILL-46)
> >>>>>>      - create builders for all logical plan elements/make logical
> >>>> plans
> >>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
> >>>>>>
> >>>>>>      Please let me know your thoughts, and if you agree please
> assign
> >>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>
> >>>>>> Best
> >>>>>> David Alves
> >>>>
> >>>>
> >>
> >>
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Hi Jacques

	I can assign issues to me now, thanks.
	What you say wrt to the logical/physical/execution layers sounds good.
	My main concern, for the moment is to have something working as fast as possible, i.e. some daemons that I'd be able to deploy to a working hbase cluster and send them work to do in some form (first step would be to treat is as a non distributed engine where each daemon runs an instance of the prototype).
	Here's where I'd like to go next:
	- lay the ground work for the daemons (scripts/rpc iface/wiring protocol).
	- create an execution engine iface that allows to abstract future implementations, and make it available through the rpc iface. this would sit in front of the ref impl for now and would be replaced by cpp down the line.
	
	I think we can probably concentrate on the capabilities iface a bit down the line but, as a first approach, I see it simply providing a simple set of ops that it is able to run internally. 
	How to abstract locality/partitioning/schema capabilities is till not clear to me though, thoughts?

David

On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm working on a presentation that will better illustrate the layers.
> There are actually three key plans.  Thinking to date has been to break
> the plans down into logical, physical and execution.  The third hasn't been
> expressed well here and is entirely an internal domain to the execution
> engine.  Following some classic methods: Logical expresses what we want to
> do, Physical expresses how we want to do it (adding points of
> parallelization but not specifying particular amounts of parallelization or
> node by node assignments).  The execution engine is then responsible for
> determining the amount of parallelization of a particular plan along with
> system load (likely leveraging Berkeley's Sparrow work), task priority and
> specific data locality information, building sub-dags to be assigned to
> individual nodes and execute the plan.
> 
> So in the higher logical and physical levels, a single Scan and subsequent
> ScanPOP should be okay...  (ScanROPs have a separate problems since they
> ignore the level of separation we're planning for the real execution layer.
> This is the why the current ref impl turns a single Scan into potentially
> a union of ScanROPs... not elegant but logically correct.)
> 
> The capabilities interface still needs to be defined for how a storage
> engine reveals its logical capabilities and thus consumes part of the plan.
> 
> J
> 
> 
> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com> wrote:
> 
>> Hi Linsen
>> 
>>        Some of what you are saying like push down of ops like filter,
>> projection or partial aggregation below the storage engine scanner level,
>> or sub tree execution are actively being discussed in issues DRILL-13
>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your input
>> in these issues is most welcome.
>> 
>>        HBase in particular has the notion of
>> enpoints/coprocessors/filters that allow pushing this down easily (this is
>> also in line with what other parallel database over nosql implementations
>> like tajo do).
>>        A possible approach is to have the optimizer change the order of
>> the ops to place them below the storage engine scanner and let the SE impl
>> deal with it internally.
>> 
>>        There are also some other pieces missing at the moment AFAIK, like
>> a distributed metadata store, the drill daemons, wiring, etc.
>> 
>>        So in summary, you're absolutely right, and if you're particularly
>> interested in the HBase SE impl (as I am, for the moment) I'd be interested
>> in collaborating.
>> 
>> Best
>> David
>> 
>> 
>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>> 
>>> Hi David,
>>> 
>>> Very nice to see your effort on this.
>>> 
>>> Hi Jacques,
>>> 
>>> we are also extending drill prototype, to see if there is any chance to
>>> meet our production need. However, We find that implementing a performant
>>> HBase storage engine is a not so straight-forward work, and requires some
>>> workaround. The problem is in Scan interface.
>>> 
>>> In drill's physical plan model, ScanROP is in charge of table scan.
>> Storage
>>> engine provides output for a whole data source, a csv file for example.
>>> It's sufficient for input source like plain file, but for hbase, it's not
>>> very efficient, if not impossible, to let ScanROP retrieve a whole htable
>>> into drill. Storage engines like HBase should have some ablility to do
>> part
>>> of the DrQL query, like Filter, if a filter can be performed by
>> specifying
>>> startRowKey and endRowKey. Storage engine like mysql could do more, even
>>> Join.
>>> 
>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>> more
>>> implementation-specific information would coupe into the plan
>> optimization
>>> & transformation phase. I guess that's the price to pay when optimization
>>> comes, or is there other way I failed to see?
>>> 
>>> Please correct me if anything is wrong.
>>> 
>>> thanks,
>>> 
>>> Lisen
>>> 
>>> 
>>> 
>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
>> wrote:
>>> 
>>>> Hi Jacques
>>>> 
>>>>       I've submitted a fist pass patch to DRILL-15.
>>>>       I did this mostly because HBase will be my main target and
>> because
>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>> Have
>>>> some thoughts that I will post soon.
>>>>       btw: I still can't assign issues to myself in JIRA, did you
>> forget
>>>> to add me as a contributor?
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>> 
>>>>> Hey David,
>>>>> 
>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>> assign
>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>> depends
>>>>> on 13 and working on the two hand in hand would probably be a good
>> idea.
>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>> some
>>>>> time to focus on it.
>>>>> 
>>>>> Jacques
>>>>> 
>>>>> 
>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi All
>>>>>> 
>>>>>>      I have a new academic project for which I'd like to use drill
>>>>>> since none of the other parallel database over hadoop/nosql
>>>> implementations
>>>>>> fit just right.
>>>>>>      To this goal I've been tinkering with the prototype trying to
>>>> find
>>>>>> where I'd be most useful.
>>>>>> 
>>>>>>      Here's where I'd like to start, if you agree:
>>>>>>      - implement HBase storage engine (DRILL-15)
>>>>>>              - start with simple scanning an push down of
>>>>>> selection/projection
>>>>>>      - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>      - setup coding style in the wiki (formatting/imports etc,
>>>> DRILL-46)
>>>>>>      - create builders for all logical plan elements/make logical
>>>> plans
>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>> 
>>>>>>      Please let me know your thoughts, and if you agree please assign
>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>> 
>>>>>> Best
>>>>>> David Alves
>>>> 
>>>> 
>> 
>>

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

I'm working on a presentation that will better illustrate the layers.
 There are actually three key plans.  Thinking to date has been to break
the plans down into logical, physical and execution.  The third hasn't been
expressed well here and is entirely an internal domain to the execution
engine.  Following some classic methods: Logical expresses what we want to
do, Physical expresses how we want to do it (adding points of
parallelization but not specifying particular amounts of parallelization or
node by node assignments).  The execution engine is then responsible for
determining the amount of parallelization of a particular plan along with
system load (likely leveraging Berkeley's Sparrow work), task priority and
specific data locality information, building sub-dags to be assigned to
individual nodes and execute the plan.

So in the higher logical and physical levels, a single Scan and subsequent
ScanPOP should be okay...  (ScanROPs have a separate problems since they
ignore the level of separation we're planning for the real execution layer.
 This is the why the current ref impl turns a single Scan into potentially
a union of ScanROPs... not elegant but logically correct.)

The capabilities interface still needs to be defined for how a storage
engine reveals its logical capabilities and thus consumes part of the plan.

J


On Tue, Mar 12, 2013 at 10:19 PM, David Alves <da...@gmail.com> wrote:

> Hi Linsen
>
>         Some of what you are saying like push down of ops like filter,
> projection or partial aggregation below the storage engine scanner level,
> or sub tree execution are actively being discussed in issues DRILL-13
> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your input
> in these issues is most welcome.
>
>         HBase in particular has the notion of
> enpoints/coprocessors/filters that allow pushing this down easily (this is
> also in line with what other parallel database over nosql implementations
> like tajo do).
>         A possible approach is to have the optimizer change the order of
> the ops to place them below the storage engine scanner and let the SE impl
> deal with it internally.
>
>         There are also some other pieces missing at the moment AFAIK, like
> a distributed metadata store, the drill daemons, wiring, etc.
>
>         So in summary, you're absolutely right, and if you're particularly
> interested in the HBase SE impl (as I am, for the moment) I'd be interested
> in collaborating.
>
> Best
> David
>
>
> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>
> > Hi David,
> >
> > Very nice to see your effort on this.
> >
> > Hi Jacques,
> >
> > we are also extending drill prototype, to see if there is any chance to
> > meet our production need. However, We find that implementing a performant
> > HBase storage engine is a not so straight-forward work, and requires some
> > workaround. The problem is in Scan interface.
> >
> > In drill's physical plan model, ScanROP is in charge of table scan.
> Storage
> > engine provides output for a whole data source, a csv file for example.
> > It's sufficient for input source like plain file, but for hbase, it's not
> > very efficient, if not impossible, to let ScanROP retrieve a whole htable
> > into drill. Storage engines like HBase should have some ablility to do
> part
> > of the DrQL query, like Filter, if a filter can be performed by
> specifying
> > startRowKey and endRowKey. Storage engine like mysql could do more, even
> > Join.
> >
> > Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
> > logical plan DAG instead of a single Scan node in logical plan. If so,
> more
> > implementation-specific information would coupe into the plan
> optimization
> > & transformation phase. I guess that's the price to pay when optimization
> > comes, or is there other way I failed to see?
> >
> > Please correct me if anything is wrong.
> >
> > thanks,
> >
> > Lisen
> >
> >
> >
> > On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi Jacques
> >>
> >>        I've submitted a fist pass patch to DRILL-15.
> >>        I did this mostly because HBase will be my main target and
> because
> >> I wanted to get a feel of what would be a nice interface for DRILL-13.
> Have
> >> some thoughts that I will post soon.
> >>        btw: I still can't assign issues to myself in JIRA, did you
> forget
> >> to add me as a contributor?
> >>
> >> Best
> >> David
> >>
> >> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >>
> >>> Hey David,
> >>>
> >>> These sound good.  I've add you as a contributor on jira so you can
> >> assign
> >>> tasks to yourself.  I think 45 and 46 are good places to start.  15
> >> depends
> >>> on 13 and working on the two hand in hand would probably be a good
> idea.
> >>> Maybe we could do a design discussion on 15 and 13 here once you have
> >> some
> >>> time to focus on it.
> >>>
> >>> Jacques
> >>>
> >>>
> >>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi All
> >>>>
> >>>>       I have a new academic project for which I'd like to use drill
> >>>> since none of the other parallel database over hadoop/nosql
> >> implementations
> >>>> fit just right.
> >>>>       To this goal I've been tinkering with the prototype trying to
> >> find
> >>>> where I'd be most useful.
> >>>>
> >>>>       Here's where I'd like to start, if you agree:
> >>>>       - implement HBase storage engine (DRILL-15)
> >>>>               - start with simple scanning an push down of
> >>>> selection/projection
> >>>>       - implement the LogicalPlanBuilder (DRILL-45)
> >>>>       - setup coding style in the wiki (formatting/imports etc,
> >> DRILL-46)
> >>>>       - create builders for all logical plan elements/make logical
> >> plans
> >>>> immutable (no issue for this, I'd like to hear your thoughts first).
> >>>>
> >>>>       Please let me know your thoughts, and if you agree please assign
> >>>> the issues to me (it seems that I can't assign them myself).
> >>>>
> >>>> Best
> >>>> David Alves
> >>
> >>
>
>

Re: contribution

Posted by Lisen Mu <im...@gmail.com>.

David,

That's great, you are making point more clear in DRILL-13. Would come back
there when I get more clue on this.






On Wed, Mar 13, 2013 at 1:19 PM, David Alves <da...@gmail.com> wrote:

> Hi Linsen
>
>         Some of what you are saying like push down of ops like filter,
> projection or partial aggregation below the storage engine scanner level,
> or sub tree execution are actively being discussed in issues DRILL-13
> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your input
> in these issues is most welcome.
>
>         HBase in particular has the notion of
> enpoints/coprocessors/filters that allow pushing this down easily (this is
> also in line with what other parallel database over nosql implementations
> like tajo do).
>         A possible approach is to have the optimizer change the order of
> the ops to place them below the storage engine scanner and let the SE impl
> deal with it internally.
>
>         There are also some other pieces missing at the moment AFAIK, like
> a distributed metadata store, the drill daemons, wiring, etc.
>
>         So in summary, you're absolutely right, and if you're particularly
> interested in the HBase SE impl (as I am, for the moment) I'd be interested
> in collaborating.
>
> Best
> David
>
>
> On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:
>
> > Hi David,
> >
> > Very nice to see your effort on this.
> >
> > Hi Jacques,
> >
> > we are also extending drill prototype, to see if there is any chance to
> > meet our production need. However, We find that implementing a performant
> > HBase storage engine is a not so straight-forward work, and requires some
> > workaround. The problem is in Scan interface.
> >
> > In drill's physical plan model, ScanROP is in charge of table scan.
> Storage
> > engine provides output for a whole data source, a csv file for example.
> > It's sufficient for input source like plain file, but for hbase, it's not
> > very efficient, if not impossible, to let ScanROP retrieve a whole htable
> > into drill. Storage engines like HBase should have some ablility to do
> part
> > of the DrQL query, like Filter, if a filter can be performed by
> specifying
> > startRowKey and endRowKey. Storage engine like mysql could do more, even
> > Join.
> >
> > Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
> > logical plan DAG instead of a single Scan node in logical plan. If so,
> more
> > implementation-specific information would coupe into the plan
> optimization
> > & transformation phase. I guess that's the price to pay when optimization
> > comes, or is there other way I failed to see?
> >
> > Please correct me if anything is wrong.
> >
> > thanks,
> >
> > Lisen
> >
> >
> >
> > On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi Jacques
> >>
> >>        I've submitted a fist pass patch to DRILL-15.
> >>        I did this mostly because HBase will be my main target and
> because
> >> I wanted to get a feel of what would be a nice interface for DRILL-13.
> Have
> >> some thoughts that I will post soon.
> >>        btw: I still can't assign issues to myself in JIRA, did you
> forget
> >> to add me as a contributor?
> >>
> >> Best
> >> David
> >>
> >> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >>
> >>> Hey David,
> >>>
> >>> These sound good.  I've add you as a contributor on jira so you can
> >> assign
> >>> tasks to yourself.  I think 45 and 46 are good places to start.  15
> >> depends
> >>> on 13 and working on the two hand in hand would probably be a good
> idea.
> >>> Maybe we could do a design discussion on 15 and 13 here once you have
> >> some
> >>> time to focus on it.
> >>>
> >>> Jacques
> >>>
> >>>
> >>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi All
> >>>>
> >>>>       I have a new academic project for which I'd like to use drill
> >>>> since none of the other parallel database over hadoop/nosql
> >> implementations
> >>>> fit just right.
> >>>>       To this goal I've been tinkering with the prototype trying to
> >> find
> >>>> where I'd be most useful.
> >>>>
> >>>>       Here's where I'd like to start, if you agree:
> >>>>       - implement HBase storage engine (DRILL-15)
> >>>>               - start with simple scanning an push down of
> >>>> selection/projection
> >>>>       - implement the LogicalPlanBuilder (DRILL-45)
> >>>>       - setup coding style in the wiki (formatting/imports etc,
> >> DRILL-46)
> >>>>       - create builders for all logical plan elements/make logical
> >> plans
> >>>> immutable (no issue for this, I'd like to hear your thoughts first).
> >>>>
> >>>>       Please let me know your thoughts, and if you agree please assign
> >>>> the issues to me (it seems that I can't assign them myself).
> >>>>
> >>>> Best
> >>>> David Alves
> >>
> >>
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Hi Linsen

	Some of what you are saying like push down of ops like filter, projection or partial aggregation below the storage engine scanner level, or sub tree execution are actively being discussed in issues DRILL-13 (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your input in these issues is most welcome.

	HBase in particular has the notion of enpoints/coprocessors/filters that allow pushing this down easily (this is also in line with what other parallel database over nosql implementations like tajo do).
	A possible approach is to have the optimizer change the order of the ops to place them below the storage engine scanner and let the SE impl deal with it internally.

	There are also some other pieces missing at the moment AFAIK, like a distributed metadata store, the drill daemons, wiring, etc.

	So in summary, you're absolutely right, and if you're particularly interested in the HBase SE impl (as I am, for the moment) I'd be interested in collaborating.

Best
David

	
On Mar 12, 2013, at 11:44 PM, Lisen Mu <im...@gmail.com> wrote:

> Hi David,
> 
> Very nice to see your effort on this.
> 
> Hi Jacques,
> 
> we are also extending drill prototype, to see if there is any chance to
> meet our production need. However, We find that implementing a performant
> HBase storage engine is a not so straight-forward work, and requires some
> workaround. The problem is in Scan interface.
> 
> In drill's physical plan model, ScanROP is in charge of table scan. Storage
> engine provides output for a whole data source, a csv file for example.
> It's sufficient for input source like plain file, but for hbase, it's not
> very efficient, if not impossible, to let ScanROP retrieve a whole htable
> into drill. Storage engines like HBase should have some ablility to do part
> of the DrQL query, like Filter, if a filter can be performed by specifying
> startRowKey and endRowKey. Storage engine like mysql could do more, even
> Join.
> 
> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
> logical plan DAG instead of a single Scan node in logical plan. If so, more
> implementation-specific information would coupe into the plan optimization
> & transformation phase. I guess that's the price to pay when optimization
> comes, or is there other way I failed to see?
> 
> Please correct me if anything is wrong.
> 
> thanks,
> 
> Lisen
> 
> 
> 
> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com> wrote:
> 
>> Hi Jacques
>> 
>>        I've submitted a fist pass patch to DRILL-15.
>>        I did this mostly because HBase will be my main target and because
>> I wanted to get a feel of what would be a nice interface for DRILL-13. Have
>> some thoughts that I will post soon.
>>        btw: I still can't assign issues to myself in JIRA, did you forget
>> to add me as a contributor?
>> 
>> Best
>> David
>> 
>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> 
>>> Hey David,
>>> 
>>> These sound good.  I've add you as a contributor on jira so you can
>> assign
>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>> depends
>>> on 13 and working on the two hand in hand would probably be a good idea.
>>> Maybe we could do a design discussion on 15 and 13 here once you have
>> some
>>> time to focus on it.
>>> 
>>> Jacques
>>> 
>>> 
>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
>> wrote:
>>> 
>>>> Hi All
>>>> 
>>>>       I have a new academic project for which I'd like to use drill
>>>> since none of the other parallel database over hadoop/nosql
>> implementations
>>>> fit just right.
>>>>       To this goal I've been tinkering with the prototype trying to
>> find
>>>> where I'd be most useful.
>>>> 
>>>>       Here's where I'd like to start, if you agree:
>>>>       - implement HBase storage engine (DRILL-15)
>>>>               - start with simple scanning an push down of
>>>> selection/projection
>>>>       - implement the LogicalPlanBuilder (DRILL-45)
>>>>       - setup coding style in the wiki (formatting/imports etc,
>> DRILL-46)
>>>>       - create builders for all logical plan elements/make logical
>> plans
>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>> 
>>>>       Please let me know your thoughts, and if you agree please assign
>>>> the issues to me (it seems that I can't assign them myself).
>>>> 
>>>> Best
>>>> David Alves
>> 
>>

Re: contribution

Posted by Lisen Mu <im...@gmail.com>.

Hi David,

Very nice to see your effort on this.

Hi Jacques,

we are also extending drill prototype, to see if there is any chance to
meet our production need. However, We find that implementing a performant
HBase storage engine is a not so straight-forward work, and requires some
workaround. The problem is in Scan interface.

In drill's physical plan model, ScanROP is in charge of table scan. Storage
engine provides output for a whole data source, a csv file for example.
It's sufficient for input source like plain file, but for hbase, it's not
very efficient, if not impossible, to let ScanROP retrieve a whole htable
into drill. Storage engines like HBase should have some ablility to do part
of the DrQL query, like Filter, if a filter can be performed by specifying
startRowKey and endRowKey. Storage engine like mysql could do more, even
Join.

Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
logical plan DAG instead of a single Scan node in logical plan. If so, more
implementation-specific information would coupe into the plan optimization
& transformation phase. I guess that's the price to pay when optimization
comes, or is there other way I failed to see?

Please correct me if anything is wrong.

thanks,

Lisen

On Wed, Mar 13, 2013 at 9:33 AM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
>
>         I've submitted a fist pass patch to DRILL-15.
>         I did this mostly because HBase will be my main target and because
> I wanted to get a feel of what would be a nice interface for DRILL-13. Have
> some thoughts that I will post soon.
>         btw: I still can't assign issues to myself in JIRA, did you forget
> to add me as a contributor?
>
> Best
> David
>
> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey David,
> >
> > These sound good.  I've add you as a contributor on jira so you can
> assign
> > tasks to yourself.  I think 45 and 46 are good places to start.  15
> depends
> > on 13 and working on the two hand in hand would probably be a good idea.
> > Maybe we could do a design discussion on 15 and 13 here once you have
> some
> > time to focus on it.
> >
> > Jacques
> >
> >
> > On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi All
> >>
> >>        I have a new academic project for which I'd like to use drill
> >> since none of the other parallel database over hadoop/nosql
> implementations
> >> fit just right.
> >>        To this goal I've been tinkering with the prototype trying to
> find
> >> where I'd be most useful.
> >>
> >>        Here's where I'd like to start, if you agree:
> >>        - implement HBase storage engine (DRILL-15)
> >>                - start with simple scanning an push down of
> >> selection/projection
> >>        - implement the LogicalPlanBuilder (DRILL-45)
> >>        - setup coding style in the wiki (formatting/imports etc,
> DRILL-46)
> >>        - create builders for all logical plan elements/make logical
> plans
> >> immutable (no issue for this, I'd like to hear your thoughts first).
> >>
> >>        Please let me know your thoughts, and if you agree please assign
> >> the issues to me (it seems that I can't assign them myself).
> >>
> >> Best
> >> David Alves
>
>

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

Weird, I re-added you.  Hopefully assignments should work for you now.

I will have a look at DRILL-15.  As you mention there, the
the serialization deserialization and other interfaces provided in the
reference impl are very different than what we've sketching out regarding
the full impl.

J


On Tue, Mar 12, 2013 at 6:33 PM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
>
>         I've submitted a fist pass patch to DRILL-15.
>         I did this mostly because HBase will be my main target and because
> I wanted to get a feel of what would be a nice interface for DRILL-13. Have
> some thoughts that I will post soon.
>         btw: I still can't assign issues to myself in JIRA, did you forget
> to add me as a contributor?
>
> Best
> David
>
> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey David,
> >
> > These sound good.  I've add you as a contributor on jira so you can
> assign
> > tasks to yourself.  I think 45 and 46 are good places to start.  15
> depends
> > on 13 and working on the two hand in hand would probably be a good idea.
> > Maybe we could do a design discussion on 15 and 13 here once you have
> some
> > time to focus on it.
> >
> > Jacques
> >
> >
> > On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com>
> wrote:
> >
> >> Hi All
> >>
> >>        I have a new academic project for which I'd like to use drill
> >> since none of the other parallel database over hadoop/nosql
> implementations
> >> fit just right.
> >>        To this goal I've been tinkering with the prototype trying to
> find
> >> where I'd be most useful.
> >>
> >>        Here's where I'd like to start, if you agree:
> >>        - implement HBase storage engine (DRILL-15)
> >>                - start with simple scanning an push down of
> >> selection/projection
> >>        - implement the LogicalPlanBuilder (DRILL-45)
> >>        - setup coding style in the wiki (formatting/imports etc,
> DRILL-46)
> >>        - create builders for all logical plan elements/make logical
> plans
> >> immutable (no issue for this, I'd like to hear your thoughts first).
> >>
> >>        Please let me know your thoughts, and if you agree please assign
> >> the issues to me (it seems that I can't assign them myself).
> >>
> >> Best
> >> David Alves
>
>

Re: contribution

Posted by David Alves <da...@gmail.com>.

Hi Jacques

	I've submitted a fist pass patch to DRILL-15.
	I did this mostly because HBase will be my main target and because I wanted to get a feel of what would be a nice interface for DRILL-13. Have some thoughts that I will post soon.
	btw: I still can't assign issues to myself in JIRA, did you forget to add me as a contributor?

Best
David

On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Hey David,
> 
> These sound good.  I've add you as a contributor on jira so you can assign
> tasks to yourself.  I think 45 and 46 are good places to start.  15 depends
> on 13 and working on the two hand in hand would probably be a good idea.
> Maybe we could do a design discussion on 15 and 13 here once you have some
> time to focus on it.
> 
> Jacques
> 
> 
> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com> wrote:
> 
>> Hi All
>> 
>>        I have a new academic project for which I'd like to use drill
>> since none of the other parallel database over hadoop/nosql implementations
>> fit just right.
>>        To this goal I've been tinkering with the prototype trying to find
>> where I'd be most useful.
>> 
>>        Here's where I'd like to start, if you agree:
>>        - implement HBase storage engine (DRILL-15)
>>                - start with simple scanning an push down of
>> selection/projection
>>        - implement the LogicalPlanBuilder (DRILL-45)
>>        - setup coding style in the wiki (formatting/imports etc, DRILL-46)
>>        - create builders for all logical plan elements/make logical plans
>> immutable (no issue for this, I'd like to hear your thoughts first).
>> 
>>        Please let me know your thoughts, and if you agree please assign
>> the issues to me (it seems that I can't assign them myself).
>> 
>> Best
>> David Alves

Re: contribution

Posted by Jacques Nadeau <ja...@apache.org>.

Hey David,

These sound good.  I've add you as a contributor on jira so you can assign
tasks to yourself.  I think 45 and 46 are good places to start.  15 depends
on 13 and working on the two hand in hand would probably be a good idea.
 Maybe we could do a design discussion on 15 and 13 here once you have some
time to focus on it.

Jacques


On Mon, Mar 11, 2013 at 3:02 AM, David Alves <da...@gmail.com> wrote:

> Hi All
>
>         I have a new academic project for which I'd like to use drill
> since none of the other parallel database over hadoop/nosql implementations
> fit just right.
>         To this goal I've been tinkering with the prototype trying to find
> where I'd be most useful.
>
>         Here's where I'd like to start, if you agree:
>         - implement HBase storage engine (DRILL-15)
>                 - start with simple scanning an push down of
> selection/projection
>         - implement the LogicalPlanBuilder (DRILL-45)
>         - setup coding style in the wiki (formatting/imports etc, DRILL-46)
>         - create builders for all logical plan elements/make logical plans
> immutable (no issue for this, I'd like to hear your thoughts first).
>
>         Please let me know your thoughts, and if you agree please assign
> the issues to me (it seems that I can't assign them myself).
>
> Best
> David Alves