You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by devansh kumar <de...@yahoo.com> on 2013/04/04 12:27:11 UTC

Basic queries regarding Apache Drill working

Hi,

I am new and am trying to understand how Apache Drill  works but i have a few queries.
Can anyone help me understand these things?

1.
I am trying to understand if the execution engine is going to break up the data.
What will happen if i am trying to an aggregation operation like (AVERAGE).
How will that work??
I have seen operations as SUM and COUNT.
How will the Query execution tree look like in case of an AVERAGE

2.
Does the Resource model is optimized when compared to MapReduce.

Regards,
Devansh Rusia.

Re: Basic queries regarding Apache Drill working

Posted by Jacques Nadeau <ja...@apache.org>.

Oops, meant to include a reference as an example of streaming algorithms:
https://github.com/clearspring/stream-lib



On Fri, Apr 5, 2013 at 8:34 AM, Jacques Nadeau <ja...@apache.org> wrote:

> The current thinking is that there will be an approximate query flag.
>  This will be useful in situations where parallel approximations can be
> made.  The simplest example is you want a top 10 group by attr1.  You can
> do a local top N group by attr1 and then merge those results.  While not
> exactly right, it can be statistically accurate based on the right choice
> of N.  There is also parallel approximations for other things such as
> median using streaming algorithms.  The goal is for Drill to be able to use
> these approximation algorithms in a processing tree for more queries.  In
> the case that a user needs exact results, full shuffle/aggregations will
> still need to be done.  They will still benefit from avoiding the various
> MapReduce barriers and requirements for persistence between stages.
>
> J
>
>
> On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <de...@yahoo.com>wrote:
>
>> Hi,
>>
>> I understood what you wanted to say of using SUM and COUNT for
>> calculating AVERAGE.
>> But as i understand this will work very well with Distributed
>> operations..... what about operations like Median.
>>
>> Also i wanted to ask how the query will be broken up in
>> the execution engine.
>> I have gone through the Apache drill documentation and also Google Dremel
>> paper, and i am still confused that how multiple level of aggregation
>> will be created inside one tree.
>>
>> Thanks!
>>
>>
>>
>> ________________________________
>>  From: devansh kumar <de...@yahoo.com>
>> To: Andrew Brust <an...@bluebadgeinsights.com>; "
>> drill-user@incubator.apache.org" <dr...@incubator.apache.org>; "
>> ted.dunning@gmail.com" <te...@gmail.com>
>> Sent: Friday, April 5, 2013 10:18 AM
>> Subject: Re: Basic queries regarding Apache Drill working
>>
>>
>> Hi,
>>
>> As Andrew asked, how will average work without an operation of Reduce
>> present.
>> Can you explain more on how will the data be aggregated?
>>
>>
>>
>>
>> ________________________________
>>  From: Andrew Brust <an...@bluebadgeinsights.com>
>> To: "drill-user@incubator.apache.org" <dr...@incubator.apache.org>;
>> devansh kumar <de...@yahoo.com>
>> Sent: Thursday, April 4, 2013 8:00 PM
>> Subject: RE: Basic queries regarding Apache Drill working
>>
>> Still not sure I follow (and pardon what must be a very rudimentary
>> misunderstanding on my part) how you get an average across a data set if
>> the data is split across nodes.  With MapReduce, the reducer can get it
>> because all data for a given key is kept to one node.  How would this work
>> with Drill?
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Thursday, April 4, 2013 9:27 AM
>> To: drill-user@incubator.apache.org; devansh kumar
>> Subject: Re: Basic queries regarding Apache Drill working
>>
>> On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <devansh_kumar@yahoo.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I am new and am
>>  trying to understand how Apache Drill  works but i
>> > have a few queries.
>> > Can anyone help me understand these things?
>> >
>> > 1.
>> > I am trying to understand if the execution engine is going to break up
>> > the data.
>> >
>>
>> Normally the data will already have been broken up across a cluster.
>>
>>
>> > What will happen if i am trying to an aggregation operation like
>> (AVERAGE).
>> > How will that work??
>> >
>>
>> Yes.
>>
>>
>> > I have seen operations as SUM and COUNT.
>> > How will the Query execution tree look like in case of an AVERAGE
>> >
>>
>> It will look exactly like a SUM or COUNT except that two numbers will be
>> accumulated instead of one.
>>
>>
>> > 2.
>> > Does the Resource model is optimized when compared to MapReduce.
>> >
>>
>> Yes.  This will happen because multiple levels of aggregation can be done
>> in one tree without the barrier between map and reduce
>>  imposed by the MapReduce structure.
>>
>
>

RE: Basic queries regarding Apache Drill working

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

And then really smart consultants will explain to them what that really does, and they'll freak a little :)

From: jacques.drill@gmail.com [mailto:jacques.drill@gmail.com] On Behalf Of Jacques Nadeau
Sent: Friday, April 5, 2013 12:08 PM
To: Andrew Brust
Cc: drill-user@incubator.apache.org; devansh kumar; ted.dunning@gmail.com
Subject: Re: Basic queries regarding Apache Drill working

Agreed.  Trusting statistics is always a little scary.  My gut is that whatever the Drill default is, admins will set the approximate flag on by default and analysts won't even realize it most of the time... They'll just get faster answers and be happy.

On Fri, Apr 5, 2013 at 8:53 AM, Andrew Brust <an...@bluebadgeinsights.com>> wrote:
OK, thank you for that explanation.  The whole notion of "not exactly right" scares me a bit, but I do see the utility in the approach and the point that over a large enough dataset, the statistical accuracy can still be there.  Also agreed that a one-pass process beats a two-pass with intermediate persistence.

From: jacques.drill@gmail.com<ma...@gmail.com> [mailto:jacques.drill@gmail.com<ma...@gmail.com>] On Behalf Of Jacques Nadeau
Sent: Friday, April 5, 2013 11:34 AM
To: drill-user@incubator.apache.org<ma...@incubator.apache.org>; devansh kumar
Cc: Andrew Brust; ted.dunning@gmail.com<ma...@gmail.com>

Subject: Re: Basic queries regarding Apache Drill working

The current thinking is that there will be an approximate query flag.  This will be useful in situations where parallel approximations can be made.  The simplest example is you want a top 10 group by attr1.  You can do a local top N group by attr1 and then merge those results.  While not exactly right, it can be statistically accurate based on the right choice of N.  There is also parallel approximations for other things such as median using streaming algorithms.  The goal is for Drill to be able to use these approximation algorithms in a processing tree for more queries.  In the case that a user needs exact results, full shuffle/aggregations will still need to be done.  They will still benefit from avoiding the various MapReduce barriers and requirements for persistence between stages.

J
On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <de...@yahoo.com>> wrote:
Hi,

I understood what you wanted to say of using SUM and COUNT for calculating AVERAGE.
But as i understand this will work very well with Distributed operations..... what about operations like Median.

Also i wanted to ask how the query will be broken up in the execution engine.
I have gone through the Apache drill documentation and also Google Dremel paper, and i am still confused that how multiple level of aggregation
will be created inside one tree.

Thanks!

________________________________
 From: devansh kumar <de...@yahoo.com>>
To: Andrew Brust <an...@bluebadgeinsights.com>>; "drill-user@incubator.apache.org<ma...@incubator.apache.org>" <dr...@incubator.apache.org>>; "ted.dunning@gmail.com<ma...@gmail.com>" <te...@gmail.com>>
Sent: Friday, April 5, 2013 10:18 AM
Subject: Re: Basic queries regarding Apache Drill working

Hi,

As Andrew asked, how will average work without an operation of Reduce present.
Can you explain more on how will the data be aggregated?

________________________________
 From: Andrew Brust <an...@bluebadgeinsights.com>>
To: "drill-user@incubator.apache.org<ma...@incubator.apache.org>" <dr...@incubator.apache.org>>; devansh kumar <de...@yahoo.com>>
Sent: Thursday, April 4, 2013 8:00 PM
Subject: RE: Basic queries regarding Apache Drill working

Still not sure I follow (and pardon what must be a very rudimentary misunderstanding on my part) how you get an average across a data set if the data is split across nodes.  With MapReduce, the reducer can get it because all data for a given key is kept to one node.  How would this work with Drill?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com<ma...@gmail.com>]
Sent: Thursday, April 4, 2013 9:27 AM
To: drill-user@incubator.apache.org<ma...@incubator.apache.org>; devansh kumar
Subject: Re: Basic queries regarding Apache Drill working

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>>wrote:

> Hi,
>
> I am new and am
 trying to understand how Apache Drill  works but i
> have a few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up
> the data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done in one tree without the barrier between map and reduce
 imposed by the MapReduce structure.

Re: Basic queries regarding Apache Drill working

Posted by Jacques Nadeau <ja...@apache.org>.

Agreed.  Trusting statistics is always a little scary.  My gut is that
whatever the Drill default is, admins will set the approximate flag on by
default and analysts won't even realize it most of the time... They'll just
get faster answers and be happy.



On Fri, Apr 5, 2013 at 8:53 AM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

>  OK, thank you for that explanation.  The whole notion of “not exactly
> right” scares me a bit, but I do see the utility in the approach and the
> point that over a large enough dataset, the statistical accuracy can still
> be there.  Also agreed that a one-pass process beats a two-pass with
> intermediate persistence.****
>
> ** **
>
> *From:* jacques.drill@gmail.com [mailto:jacques.drill@gmail.com] *On
> Behalf Of *Jacques Nadeau
> *Sent:* Friday, April 5, 2013 11:34 AM
> *To:* drill-user@incubator.apache.org; devansh kumar
> *Cc:* Andrew Brust; ted.dunning@gmail.com
>
> *Subject:* Re: Basic queries regarding Apache Drill working****
>
> ** **
>
> The current thinking is that there will be an approximate query flag.
>  This will be useful in situations where parallel approximations can be
> made.  The simplest example is you want a top 10 group by attr1.  You can
> do a local top N group by attr1 and then merge those results.  While not
> exactly right, it can be statistically accurate based on the right choice
> of N.  There is also parallel approximations for other things such as
> median using streaming algorithms.  The goal is for Drill to be able to use
> these approximation algorithms in a processing tree for more queries.  In
> the case that a user needs exact results, full shuffle/aggregations will
> still need to be done.  They will still benefit from avoiding the various
> MapReduce barriers and requirements for persistence between stages.****
>
> ** **
>
> J****
>
> On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <de...@yahoo.com>
> wrote:****
>
> Hi,
>
> I understood what you wanted to say of using SUM and COUNT for calculating
> AVERAGE.
> But as i understand this will work very well with Distributed
> operations..... what about operations like Median.
>
> Also i wanted to ask how the query will be broken up in
> the execution engine.
> I have gone through the Apache drill documentation and also Google Dremel
> paper, and i am still confused that how multiple level of aggregation
> will be created inside one tree.
>
> Thanks!****
>
>
>
>
> ________________________________
>  From: devansh kumar <de...@yahoo.com>
> To: Andrew Brust <an...@bluebadgeinsights.com>; "
> drill-user@incubator.apache.org" <dr...@incubator.apache.org>; "
> ted.dunning@gmail.com" <te...@gmail.com>
> Sent: Friday, April 5, 2013 10:18 AM****
>
> Subject: Re: Basic queries regarding Apache Drill working
>
>
> Hi,
>
> As Andrew asked, how will average work without an operation of Reduce
> present.
> Can you explain more on how will the data be aggregated?
>
>
>
>
> ________________________________
>  From: Andrew Brust <an...@bluebadgeinsights.com>
> To: "drill-user@incubator.apache.org" <dr...@incubator.apache.org>;
> devansh kumar <de...@yahoo.com>
> Sent: Thursday, April 4, 2013 8:00 PM
> Subject: RE: Basic queries regarding Apache Drill working
>
> Still not sure I follow (and pardon what must be a very rudimentary
> misunderstanding on my part) how you get an average across a data set if
> the data is split across nodes.  With MapReduce, the reducer can get it
> because all data for a given key is kept to one node.  How would this work
> with Drill?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, April 4, 2013 9:27 AM
> To: drill-user@incubator.apache.org; devansh kumar
> Subject: Re: Basic queries regarding Apache Drill working
>
> On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <devansh_kumar@yahoo.com
> >wrote:
>
> > Hi,
> >
> > I am new and am
>  trying to understand how Apache Drill  works but i
> > have a few queries.
> > Can anyone help me understand these things?
> >
> > 1.
> > I am trying to understand if the execution engine is going to break up
> > the data.
> >
>
> Normally the data will already have been broken up across a cluster.
>
>
> > What will happen if i am trying to an aggregation operation like
> (AVERAGE).
> > How will that work??
> >
>
> Yes.
>
>
> > I have seen operations as SUM and COUNT.
> > How will the Query execution tree look like in case of an AVERAGE
> >
>
> It will look exactly like a SUM or COUNT except that two numbers will be
> accumulated instead of one.
>
>
> > 2.
> > Does the Resource model is optimized when compared to MapReduce.
> >
>
> Yes.  This will happen because multiple levels of aggregation can be done
> in one tree without the barrier between map and reduce
>  imposed by the MapReduce structure.****
>
>  ** **
>

RE: Basic queries regarding Apache Drill working

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

OK, thank you for that explanation.  The whole notion of "not exactly right" scares me a bit, but I do see the utility in the approach and the point that over a large enough dataset, the statistical accuracy can still be there.  Also agreed that a one-pass process beats a two-pass with intermediate persistence.

From: jacques.drill@gmail.com [mailto:jacques.drill@gmail.com] On Behalf Of Jacques Nadeau
Sent: Friday, April 5, 2013 11:34 AM
To: drill-user@incubator.apache.org; devansh kumar
Cc: Andrew Brust; ted.dunning@gmail.com
Subject: Re: Basic queries regarding Apache Drill working

The current thinking is that there will be an approximate query flag.  This will be useful in situations where parallel approximations can be made.  The simplest example is you want a top 10 group by attr1.  You can do a local top N group by attr1 and then merge those results.  While not exactly right, it can be statistically accurate based on the right choice of N.  There is also parallel approximations for other things such as median using streaming algorithms.  The goal is for Drill to be able to use these approximation algorithms in a processing tree for more queries.  In the case that a user needs exact results, full shuffle/aggregations will still need to be done.  They will still benefit from avoiding the various MapReduce barriers and requirements for persistence between stages.

J
On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <de...@yahoo.com>> wrote:
Hi,

I understood what you wanted to say of using SUM and COUNT for calculating AVERAGE.
But as i understand this will work very well with Distributed operations..... what about operations like Median.

Also i wanted to ask how the query will be broken up in the execution engine.
I have gone through the Apache drill documentation and also Google Dremel paper, and i am still confused that how multiple level of aggregation
will be created inside one tree.

Thanks!

________________________________
 From: devansh kumar <de...@yahoo.com>>
To: Andrew Brust <an...@bluebadgeinsights.com>>; "drill-user@incubator.apache.org<ma...@incubator.apache.org>" <dr...@incubator.apache.org>>; "ted.dunning@gmail.com<ma...@gmail.com>" <te...@gmail.com>>
Sent: Friday, April 5, 2013 10:18 AM
Subject: Re: Basic queries regarding Apache Drill working

Hi,

As Andrew asked, how will average work without an operation of Reduce present.
Can you explain more on how will the data be aggregated?

________________________________
 From: Andrew Brust <an...@bluebadgeinsights.com>>
To: "drill-user@incubator.apache.org<ma...@incubator.apache.org>" <dr...@incubator.apache.org>>; devansh kumar <de...@yahoo.com>>
Sent: Thursday, April 4, 2013 8:00 PM
Subject: RE: Basic queries regarding Apache Drill working

Still not sure I follow (and pardon what must be a very rudimentary misunderstanding on my part) how you get an average across a data set if the data is split across nodes.  With MapReduce, the reducer can get it because all data for a given key is kept to one node.  How would this work with Drill?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com<ma...@gmail.com>]
Sent: Thursday, April 4, 2013 9:27 AM
To: drill-user@incubator.apache.org<ma...@incubator.apache.org>; devansh kumar
Subject: Re: Basic queries regarding Apache Drill working

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>>wrote:

> Hi,
>
> I am new and am
 trying to understand how Apache Drill  works but i
> have a few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up
> the data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done in one tree without the barrier between map and reduce
 imposed by the MapReduce structure.

Re: Basic queries regarding Apache Drill working

Posted by Jacques Nadeau <ja...@apache.org>.

The current thinking is that there will be an approximate query flag.  This
will be useful in situations where parallel approximations can be made.
 The simplest example is you want a top 10 group by attr1.  You can do a
local top N group by attr1 and then merge those results.  While not exactly
right, it can be statistically accurate based on the right choice of N.
 There is also parallel approximations for other things such as median
using streaming algorithms.  The goal is for Drill to be able to use these
approximation algorithms in a processing tree for more queries.  In the
case that a user needs exact results, full shuffle/aggregations will still
need to be done.  They will still benefit from avoiding the various
MapReduce barriers and requirements for persistence between stages.

J

On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <de...@yahoo.com>wrote:

> Hi,
>
> I understood what you wanted to say of using SUM and COUNT for calculating
> AVERAGE.
> But as i understand this will work very well with Distributed
> operations..... what about operations like Median.
>
> Also i wanted to ask how the query will be broken up in
> the execution engine.
> I have gone through the Apache drill documentation and also Google Dremel
> paper, and i am still confused that how multiple level of aggregation
> will be created inside one tree.
>
> Thanks!
>
>
>
> ________________________________
>  From: devansh kumar <de...@yahoo.com>
> To: Andrew Brust <an...@bluebadgeinsights.com>; "
> drill-user@incubator.apache.org" <dr...@incubator.apache.org>; "
> ted.dunning@gmail.com" <te...@gmail.com>
> Sent: Friday, April 5, 2013 10:18 AM
> Subject: Re: Basic queries regarding Apache Drill working
>
>
> Hi,
>
> As Andrew asked, how will average work without an operation of Reduce
> present.
> Can you explain more on how will the data be aggregated?
>
>
>
>
> ________________________________
>  From: Andrew Brust <an...@bluebadgeinsights.com>
> To: "drill-user@incubator.apache.org" <dr...@incubator.apache.org>;
> devansh kumar <de...@yahoo.com>
> Sent: Thursday, April 4, 2013 8:00 PM
> Subject: RE: Basic queries regarding Apache Drill working
>
> Still not sure I follow (and pardon what must be a very rudimentary
> misunderstanding on my part) how you get an average across a data set if
> the data is split across nodes.  With MapReduce, the reducer can get it
> because all data for a given key is kept to one node.  How would this work
> with Drill?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, April 4, 2013 9:27 AM
> To: drill-user@incubator.apache.org; devansh kumar
> Subject: Re: Basic queries regarding Apache Drill working
>
> On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <devansh_kumar@yahoo.com
> >wrote:
>
> > Hi,
> >
> > I am new and am
>  trying to understand how Apache Drill  works but i
> > have a few queries.
> > Can anyone help me understand these things?
> >
> > 1.
> > I am trying to understand if the execution engine is going to break up
> > the data.
> >
>
> Normally the data will already have been broken up across a cluster.
>
>
> > What will happen if i am trying to an aggregation operation like
> (AVERAGE).
> > How will that work??
> >
>
> Yes.
>
>
> > I have seen operations as SUM and COUNT.
> > How will the Query execution tree look like in case of an AVERAGE
> >
>
> It will look exactly like a SUM or COUNT except that two numbers will be
> accumulated instead of one.
>
>
> > 2.
> > Does the Resource model is optimized when compared to MapReduce.
> >
>
> Yes.  This will happen because multiple levels of aggregation can be done
> in one tree without the barrier between map and reduce
>  imposed by the MapReduce structure.
>

Re: Basic queries regarding Apache Drill working

Posted by devansh kumar <de...@yahoo.com>.

Hi,

I understood what you wanted to say of using SUM and COUNT for calculating AVERAGE.
But as i understand this will work very well with Distributed operations..... what about operations like Median.

Also i wanted to ask how the query will be broken up in the execution engine.
I have gone through the Apache drill documentation and also Google Dremel paper, and i am still confused that how multiple level of aggregation 
will be created inside one tree.

Thanks!

________________________________
 From: devansh kumar <de...@yahoo.com>
To: Andrew Brust <an...@bluebadgeinsights.com>; "drill-user@incubator.apache.org" <dr...@incubator.apache.org>; "ted.dunning@gmail.com" <te...@gmail.com> 
Sent: Friday, April 5, 2013 10:18 AM
Subject: Re: Basic queries regarding Apache Drill working

Hi,

As Andrew asked, how will average work without an operation of Reduce present. 
Can you explain more on how will the data be aggregated?

________________________________
 From: Andrew Brust <an...@bluebadgeinsights.com>
To: "drill-user@incubator.apache.org" <dr...@incubator.apache.org>; devansh kumar <de...@yahoo.com> 
Sent: Thursday, April 4, 2013 8:00 PM
Subject: RE: Basic queries regarding Apache Drill working

Still not sure I follow (and pardon what must be a very rudimentary misunderstanding on my part) how you get an average across a data set if the data is split across nodes.  With MapReduce, the reducer can get it because all data for a given key is kept to one node.  How would this work with Drill?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, April 4, 2013 9:27 AM
To: drill-user@incubator.apache.org; devansh kumar
Subject: Re: Basic queries regarding Apache Drill working

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>wrote:

> Hi,
>
> I am new and am
 trying to understand how Apache Drill  works but i 
> have a few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up 
> the data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done in one tree without the barrier between map and reduce
 imposed by the MapReduce structure.

Re: Basic queries regarding Apache Drill working

Posted by devansh kumar <de...@yahoo.com>.

Hi,

As Andrew asked, how will average work without an operation of Reduce present. 
Can you explain more on how will the data be aggregated?

________________________________
 From: Andrew Brust <an...@bluebadgeinsights.com>
To: "drill-user@incubator.apache.org" <dr...@incubator.apache.org>; devansh kumar <de...@yahoo.com> 
Sent: Thursday, April 4, 2013 8:00 PM
Subject: RE: Basic queries regarding Apache Drill working

Still not sure I follow (and pardon what must be a very rudimentary misunderstanding on my part) how you get an average across a data set if the data is split across nodes.  With MapReduce, the reducer can get it because all data for a given key is kept to one node.  How would this work with Drill?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, April 4, 2013 9:27 AM
To: drill-user@incubator.apache.org; devansh kumar
Subject: Re: Basic queries regarding Apache Drill working

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>wrote:

> Hi,
>
> I am new and am trying to understand how Apache Drill  works but i 
> have a few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up 
> the data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done in one tree without the barrier between map and reduce imposed by the MapReduce structure.

RE: Basic queries regarding Apache Drill working

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

Still not sure I follow (and pardon what must be a very rudimentary misunderstanding on my part) how you get an average across a data set if the data is split across nodes.  With MapReduce, the reducer can get it because all data for a given key is kept to one node.  How would this work with Drill?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, April 4, 2013 9:27 AM
To: drill-user@incubator.apache.org; devansh kumar
Subject: Re: Basic queries regarding Apache Drill working

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>wrote:

> Hi,
>
> I am new and am trying to understand how Apache Drill  works but i 
> have a few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up 
> the data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done in one tree without the barrier between map and reduce imposed by the MapReduce structure.

Re: Basic queries regarding Apache Drill working

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <de...@yahoo.com>wrote:

> Hi,
>
> I am new and am trying to understand how Apache Drill  works but i have a
> few queries.
> Can anyone help me understand these things?
>
> 1.
> I am trying to understand if the execution engine is going to break up the
> data.
>

Normally the data will already have been broken up across a cluster.

> What will happen if i am trying to an aggregation operation like (AVERAGE).
> How will that work??
>

Yes.

> I have seen operations as SUM and COUNT.
> How will the Query execution tree look like in case of an AVERAGE
>

It will look exactly like a SUM or COUNT except that two numbers will be
accumulated instead of one.

> 2.
> Does the Resource model is optimized when compared to MapReduce.
>

Yes.  This will happen because multiple levels of aggregation can be done
in one tree without the barrier between map and reduce imposed by the
MapReduce structure.