You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Chuck Lan <cl...@modeln.com> on 2008/09/02 23:15:35 UTC

How are GROUP and COGROUP functions implemented?

Hi,

 

How are the GROUP and COGROUP functions implemented?  What's its
efficiency?

 

Thanks,

Chuck

RE: How are GROUP and COGROUP functions implemented?

Posted by Chuck Lan <cl...@modeln.com>.

The 9 minutes is actually from having configured hadoop with 7 parallel
reduce tasks.  My guess for its poor performance is because the reduce
phase is still generating a lot of lines, since the ratio of grouped
lines to ungrouped is ~1:4.

Thanks,
Chuck

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Friday, September 19, 2008 10:15 AM
To: pig-user@incubator.apache.org
Subject: Re: How are GROUP and COGROUP functions implemented?

Sorry for the slow reply.

You're doing a four way join, so it's not too surprising that it's 
taking 9 minutes to materialize that join.  One thing you could try is 
adding a parallel statement to your cogroup.  By default pig tells 
hadoop to select the parallelism of the reducer, and hadoop by default 
picks 1.  So if you put parallel X, where X is 2 times the number of 
machines you have, you should get better performance.

Alan.

Chuck Lan wrote:
> Just noticed that there is no attachment.  Here's the script, inline.
>
> register userfunc.jar
> items = load 'chuck.item' using PigStorage(',') as
(itemid,upp,wacplamt,parentid);
> ds = load 'chuck.ds.qt' using PigStorage(',') as (saleid, itemid,
linkid, invamt, invqty, custid, docid, invdatedim, subclosedatedim);
> dsfilt = filter ds by invdatedim >= 5480 and invdatedim <= 5569;
> is = load 'chuck.is.qt' using PigStorage(',') as (saleid, itemid,
linkid, contractamt, invqty, custid, docid, invdatedim,
subclosedatedim);
> isfilt = filter is by invdatedim >= 5480 and invdatedim <= 5569;
> cb = load 'chuck.cb.qt' using PigStorage(',') as (saleid, itemid,
linkid, paidcbamt, distrcostamt, invqty, custid, docid, invdatedim,
subclosedatedim);
> cbfilt = filter cb by invdatedim >= 5480 and invdatedim <= 5569;
> rb = load 'chuck.rb.qt' using PigStorage(',') as (saleid, itemid,
linkid, rbpaymentamt, wacamt, rbpaymenttype, invqty, custid, docid,
invdatedim, subclosedatedim);rbfilt = filter rb by invdatedim >= 5480
and invdatedim <= 5569;
> fa = load 'chuck.fa.qt' using PigStorage(',') as (saleid, itemid,
linkid, faamt, invqty, custid, docid, invdatedim, subclosedatedim);
> fafilt = filter fa by invdatedim >= 5480 and invdatedim <= 5569;
> linked = cogroup dsfilt by (linkid,itemid), isfilt by (linkid,itemid),
cbfilt by (linkid,itemid), rbfilt by (linkid,itemid), fafilt by
(linkid,itemid);
> linked2 = foreach linked generate flatten(group), SUM(dsfilt.invamt),
SUM(isfilt.contractamt), SUM(cbfilt.paidcbamt),
SUM(cbfilt.distrcostamt), SUM(rbfilt.rbpaymentamt), SUM(rbfilt.wacamt),
SUM(fafilt.faamt), SUM(dsfilt.invqty), SUM(isfilt.invqty),
SUM(cbfilt.invqty), SUM(rbfilt.invqty), SUM(fafilt.invqty);
> linked3 = join linked2 by $1, items by itemid;
> netpp = foreach linked3 generate $0, $17,
> ($2 != 0 ? ($9 != 0 ? ($2 - $6 - $8) / ($9 * $15) : 0) :
>         ($3 != 0 ? ($10 != 0 ? ($3 - $6 - $8) / ($10 * $15) : 0) :
>                 ($4 != 0 ? ($11 != 0 ? (($16 * 0.8) / $15) - (($6 + $4
+ $8) / $15) : 0) :
>                         ($6 != 0 ? ($12 != 0 ? (($16 * 0.8) / $15) -
($6 / $15) : 0) :
>                                 ($8 != 0 ? ($13 != 0 ? ($16 - $8) /
$15 : 0) : 0)
>                         )
>                 )
>         )
> );
> bp9g = group netpp by $1;
> bp9 = foreach bp9g generate flatten(group), MIN(netpp.$2);
> bpj = cogroup items by parentid, bp9 by $0;
> bpjflatten = foreach bpj generate flatten(((COUNT(items) == 0) ?
IdentityTuple('0') : items.itemid)), flatten(((COUNT(bp9) == 0) ?
IdentityTuple('0') : bp9.$1));
> store bpjflatten into 'qbp.txt' using PigStorage(',');
>
> Thanks,
> Chuck
>
> -----Original Message-----
> From: Chuck Lan [mailto:clan@modeln.com]
> Sent: Thu 9/4/2008 5:47 PM
> To: pig-user@incubator.apache.org
> Subject: RE: How are GROUP and COGROUP functions implemented?
>  
> Sorry for the confusion.  I'm just using the hadoop ui convention in
the jobtracker.  I guess they had to do it this way since the ui is
re-used between the mapper and the reducer tasks.
>
> Attached is the script.  It took about 15 minutes to complete, with
the majority of the time (~9 minutes) spent on the first reduce.
>
> chuck.ds.qt 561803 lines
> chuck.is.qt 1973303 lines
> chuck.cb.qt 1973303 lines
> chuck.rb.qt 2756325 lines
> chuck.fa.qt 29327 lines
>
> Thanks,
> Chuck
>
>
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Thu 9/4/2008 4:51 PM
> To: pig-user@incubator.apache.org
> Subject: Re: How are GROUP and COGROUP functions implemented?
>  
> If you gave sizes of your data, a copy of your script, and the time it

> took to run we can tell you if this matches with our experience.  I 
> don't follow what you mean when you say it is taking a lot of time 
> performing reduce->reduce.  On the web ui for your hadoop instance you

> should be able to see how much time hadoop is taking in the sort, 
> shuffle, and reduce phases.  If a long time is being taken in the
reduce 
> itself, that is pig and not hadoop.
>
> Alan.
>
> Chuck Lan wrote:
>   
>> Thanks.  That's what I thought it was doing.  I wrote a Pig script to
do
>> some simple calculation based on some number of CSV files running on
six
>> nodes.  It took longer than I had hoped for the calculation to
complete
>> for six nodes.  Maybe I just need better disks, since it seems to be
>> taking a lot of time performing reduce->reduce (vs. reduce->copy and
>> reduce->sort).
>>
>> Do you know of any Hadoop issues around performance?
>>
>> Thanks,
>> Chuck
>>  
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com] 
>> Sent: Wednesday, September 03, 2008 2:23 PM
>> To: pig-user@incubator.apache.org
>> Subject: Re: How are GROUP and COGROUP functions implemented?
>>
>> Map Reduce is written around the idea of doing project or filter (map

>> phase), then grouping (sort and shuffle) and then applying any other 
>> operations (such as aggregate functions, etc.) (reduce).  Since pig
is 
>> currently implemented on top of hadoop it makes use of hadoop's map 
>> reduce implementation to do it's (co)grouping.  The efficiency is
that 
>> of a parallel sort.
>>
>> Alan.
>>
>> Chuck Lan wrote:
>>   
>>     
>>> Hi,
>>>
>>>  
>>>
>>> How are the GROUP and COGROUP functions implemented?  What's its
>>> efficiency?
>>>
>>>  
>>>
>>> Thanks,
>>>
>>> Chuck
>>>
>>>
>>>   
>>>     
>>>       
>
>
>
>
>

Re: How are GROUP and COGROUP functions implemented?

Posted by Alan Gates <ga...@yahoo-inc.com>.

Sorry for the slow reply.

You're doing a four way join, so it's not too surprising that it's 
taking 9 minutes to materialize that join.  One thing you could try is 
adding a parallel statement to your cogroup.  By default pig tells 
hadoop to select the parallelism of the reducer, and hadoop by default 
picks 1.  So if you put parallel X, where X is 2 times the number of 
machines you have, you should get better performance.

Alan.

Chuck Lan wrote:
> Just noticed that there is no attachment.  Here's the script, inline.
>
> register userfunc.jar
> items = load 'chuck.item' using PigStorage(',') as (itemid,upp,wacplamt,parentid);
> ds = load 'chuck.ds.qt' using PigStorage(',') as (saleid, itemid, linkid, invamt, invqty, custid, docid, invdatedim, subclosedatedim);
> dsfilt = filter ds by invdatedim >= 5480 and invdatedim <= 5569;
> is = load 'chuck.is.qt' using PigStorage(',') as (saleid, itemid, linkid, contractamt, invqty, custid, docid, invdatedim, subclosedatedim);
> isfilt = filter is by invdatedim >= 5480 and invdatedim <= 5569;
> cb = load 'chuck.cb.qt' using PigStorage(',') as (saleid, itemid, linkid, paidcbamt, distrcostamt, invqty, custid, docid, invdatedim, subclosedatedim);
> cbfilt = filter cb by invdatedim >= 5480 and invdatedim <= 5569;
> rb = load 'chuck.rb.qt' using PigStorage(',') as (saleid, itemid, linkid, rbpaymentamt, wacamt, rbpaymenttype, invqty, custid, docid, invdatedim, subclosedatedim);rbfilt = filter rb by invdatedim >= 5480 and invdatedim <= 5569;
> fa = load 'chuck.fa.qt' using PigStorage(',') as (saleid, itemid, linkid, faamt, invqty, custid, docid, invdatedim, subclosedatedim);
> fafilt = filter fa by invdatedim >= 5480 and invdatedim <= 5569;
> linked = cogroup dsfilt by (linkid,itemid), isfilt by (linkid,itemid), cbfilt by (linkid,itemid), rbfilt by (linkid,itemid), fafilt by (linkid,itemid);
> linked2 = foreach linked generate flatten(group), SUM(dsfilt.invamt), SUM(isfilt.contractamt), SUM(cbfilt.paidcbamt), SUM(cbfilt.distrcostamt), SUM(rbfilt.rbpaymentamt), SUM(rbfilt.wacamt), SUM(fafilt.faamt), SUM(dsfilt.invqty), SUM(isfilt.invqty), SUM(cbfilt.invqty), SUM(rbfilt.invqty), SUM(fafilt.invqty);
> linked3 = join linked2 by $1, items by itemid;
> netpp = foreach linked3 generate $0, $17,
> ($2 != 0 ? ($9 != 0 ? ($2 - $6 - $8) / ($9 * $15) : 0) :
>         ($3 != 0 ? ($10 != 0 ? ($3 - $6 - $8) / ($10 * $15) : 0) :
>                 ($4 != 0 ? ($11 != 0 ? (($16 * 0.8) / $15) - (($6 + $4 + $8) / $15) : 0) :
>                         ($6 != 0 ? ($12 != 0 ? (($16 * 0.8) / $15) - ($6 / $15) : 0) :
>                                 ($8 != 0 ? ($13 != 0 ? ($16 - $8) / $15 : 0) : 0)
>                         )
>                 )
>         )
> );
> bp9g = group netpp by $1;
> bp9 = foreach bp9g generate flatten(group), MIN(netpp.$2);
> bpj = cogroup items by parentid, bp9 by $0;
> bpjflatten = foreach bpj generate flatten(((COUNT(items) == 0) ? IdentityTuple('0') : items.itemid)), flatten(((COUNT(bp9) == 0) ? IdentityTuple('0') : bp9.$1));
> store bpjflatten into 'qbp.txt' using PigStorage(',');
>
> Thanks,
> Chuck
>
> -----Original Message-----
> From: Chuck Lan [mailto:clan@modeln.com]
> Sent: Thu 9/4/2008 5:47 PM
> To: pig-user@incubator.apache.org
> Subject: RE: How are GROUP and COGROUP functions implemented?
>  
> Sorry for the confusion.  I'm just using the hadoop ui convention in the jobtracker.  I guess they had to do it this way since the ui is re-used between the mapper and the reducer tasks.
>
> Attached is the script.  It took about 15 minutes to complete, with the majority of the time (~9 minutes) spent on the first reduce.
>
> chuck.ds.qt 561803 lines
> chuck.is.qt 1973303 lines
> chuck.cb.qt 1973303 lines
> chuck.rb.qt 2756325 lines
> chuck.fa.qt 29327 lines
>
> Thanks,
> Chuck
>
>
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Thu 9/4/2008 4:51 PM
> To: pig-user@incubator.apache.org
> Subject: Re: How are GROUP and COGROUP functions implemented?
>  
> If you gave sizes of your data, a copy of your script, and the time it 
> took to run we can tell you if this matches with our experience.  I 
> don't follow what you mean when you say it is taking a lot of time 
> performing reduce->reduce.  On the web ui for your hadoop instance you 
> should be able to see how much time hadoop is taking in the sort, 
> shuffle, and reduce phases.  If a long time is being taken in the reduce 
> itself, that is pig and not hadoop.
>
> Alan.
>
> Chuck Lan wrote:
>   
>> Thanks.  That's what I thought it was doing.  I wrote a Pig script to do
>> some simple calculation based on some number of CSV files running on six
>> nodes.  It took longer than I had hoped for the calculation to complete
>> for six nodes.  Maybe I just need better disks, since it seems to be
>> taking a lot of time performing reduce->reduce (vs. reduce->copy and
>> reduce->sort).
>>
>> Do you know of any Hadoop issues around performance?
>>
>> Thanks,
>> Chuck
>>  
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com] 
>> Sent: Wednesday, September 03, 2008 2:23 PM
>> To: pig-user@incubator.apache.org
>> Subject: Re: How are GROUP and COGROUP functions implemented?
>>
>> Map Reduce is written around the idea of doing project or filter (map 
>> phase), then grouping (sort and shuffle) and then applying any other 
>> operations (such as aggregate functions, etc.) (reduce).  Since pig is 
>> currently implemented on top of hadoop it makes use of hadoop's map 
>> reduce implementation to do it's (co)grouping.  The efficiency is that 
>> of a parallel sort.
>>
>> Alan.
>>
>> Chuck Lan wrote:
>>   
>>     
>>> Hi,
>>>
>>>  
>>>
>>> How are the GROUP and COGROUP functions implemented?  What's its
>>> efficiency?
>>>
>>>  
>>>
>>> Thanks,
>>>
>>> Chuck
>>>
>>>
>>>   
>>>     
>>>       
>
>
>
>
>

RE: How are GROUP and COGROUP functions implemented?

Posted by Chuck Lan <cl...@modeln.com>.

Just noticed that there is no attachment.  Here's the script, inline.

register userfunc.jar
items = load 'chuck.item' using PigStorage(',') as (itemid,upp,wacplamt,parentid);
ds = load 'chuck.ds.qt' using PigStorage(',') as (saleid, itemid, linkid, invamt, invqty, custid, docid, invdatedim, subclosedatedim);
dsfilt = filter ds by invdatedim >= 5480 and invdatedim <= 5569;
is = load 'chuck.is.qt' using PigStorage(',') as (saleid, itemid, linkid, contractamt, invqty, custid, docid, invdatedim, subclosedatedim);
isfilt = filter is by invdatedim >= 5480 and invdatedim <= 5569;
cb = load 'chuck.cb.qt' using PigStorage(',') as (saleid, itemid, linkid, paidcbamt, distrcostamt, invqty, custid, docid, invdatedim, subclosedatedim);
cbfilt = filter cb by invdatedim >= 5480 and invdatedim <= 5569;
rb = load 'chuck.rb.qt' using PigStorage(',') as (saleid, itemid, linkid, rbpaymentamt, wacamt, rbpaymenttype, invqty, custid, docid, invdatedim, subclosedatedim);rbfilt = filter rb by invdatedim >= 5480 and invdatedim <= 5569;
fa = load 'chuck.fa.qt' using PigStorage(',') as (saleid, itemid, linkid, faamt, invqty, custid, docid, invdatedim, subclosedatedim);
fafilt = filter fa by invdatedim >= 5480 and invdatedim <= 5569;
linked = cogroup dsfilt by (linkid,itemid), isfilt by (linkid,itemid), cbfilt by (linkid,itemid), rbfilt by (linkid,itemid), fafilt by (linkid,itemid);
linked2 = foreach linked generate flatten(group), SUM(dsfilt.invamt), SUM(isfilt.contractamt), SUM(cbfilt.paidcbamt), SUM(cbfilt.distrcostamt), SUM(rbfilt.rbpaymentamt), SUM(rbfilt.wacamt), SUM(fafilt.faamt), SUM(dsfilt.invqty), SUM(isfilt.invqty), SUM(cbfilt.invqty), SUM(rbfilt.invqty), SUM(fafilt.invqty);
linked3 = join linked2 by $1, items by itemid;
netpp = foreach linked3 generate $0, $17,
($2 != 0 ? ($9 != 0 ? ($2 - $6 - $8) / ($9 * $15) : 0) :
        ($3 != 0 ? ($10 != 0 ? ($3 - $6 - $8) / ($10 * $15) : 0) :
                ($4 != 0 ? ($11 != 0 ? (($16 * 0.8) / $15) - (($6 + $4 + $8) / $15) : 0) :
                        ($6 != 0 ? ($12 != 0 ? (($16 * 0.8) / $15) - ($6 / $15) : 0) :
                                ($8 != 0 ? ($13 != 0 ? ($16 - $8) / $15 : 0) : 0)
                        )
                )
        )
);
bp9g = group netpp by $1;
bp9 = foreach bp9g generate flatten(group), MIN(netpp.$2);
bpj = cogroup items by parentid, bp9 by $0;
bpjflatten = foreach bpj generate flatten(((COUNT(items) == 0) ? IdentityTuple('0') : items.itemid)), flatten(((COUNT(bp9) == 0) ? IdentityTuple('0') : bp9.$1));
store bpjflatten into 'qbp.txt' using PigStorage(',');

Thanks,
Chuck

-----Original Message-----
From: Chuck Lan [mailto:clan@modeln.com]
Sent: Thu 9/4/2008 5:47 PM
To: pig-user@incubator.apache.org
Subject: RE: How are GROUP and COGROUP functions implemented?
 
Sorry for the confusion.  I'm just using the hadoop ui convention in the jobtracker.  I guess they had to do it this way since the ui is re-used between the mapper and the reducer tasks.

Attached is the script.  It took about 15 minutes to complete, with the majority of the time (~9 minutes) spent on the first reduce.

chuck.ds.qt 561803 lines
chuck.is.qt 1973303 lines
chuck.cb.qt 1973303 lines
chuck.rb.qt 2756325 lines
chuck.fa.qt 29327 lines

Thanks,
Chuck



-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com]
Sent: Thu 9/4/2008 4:51 PM
To: pig-user@incubator.apache.org
Subject: Re: How are GROUP and COGROUP functions implemented?
 
If you gave sizes of your data, a copy of your script, and the time it 
took to run we can tell you if this matches with our experience.  I 
don't follow what you mean when you say it is taking a lot of time 
performing reduce->reduce.  On the web ui for your hadoop instance you 
should be able to see how much time hadoop is taking in the sort, 
shuffle, and reduce phases.  If a long time is being taken in the reduce 
itself, that is pig and not hadoop.

Alan.

Chuck Lan wrote:
> Thanks.  That's what I thought it was doing.  I wrote a Pig script to do
> some simple calculation based on some number of CSV files running on six
> nodes.  It took longer than I had hoped for the calculation to complete
> for six nodes.  Maybe I just need better disks, since it seems to be
> taking a lot of time performing reduce->reduce (vs. reduce->copy and
> reduce->sort).
>
> Do you know of any Hadoop issues around performance?
>
> Thanks,
> Chuck
>  
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Wednesday, September 03, 2008 2:23 PM
> To: pig-user@incubator.apache.org
> Subject: Re: How are GROUP and COGROUP functions implemented?
>
> Map Reduce is written around the idea of doing project or filter (map 
> phase), then grouping (sort and shuffle) and then applying any other 
> operations (such as aggregate functions, etc.) (reduce).  Since pig is 
> currently implemented on top of hadoop it makes use of hadoop's map 
> reduce implementation to do it's (co)grouping.  The efficiency is that 
> of a parallel sort.
>
> Alan.
>
> Chuck Lan wrote:
>   
>> Hi,
>>
>>  
>>
>> How are the GROUP and COGROUP functions implemented?  What's its
>> efficiency?
>>
>>  
>>
>> Thanks,
>>
>> Chuck
>>
>>
>>   
>>

RE: How are GROUP and COGROUP functions implemented?

Posted by Chuck Lan <cl...@modeln.com>.

Sorry for the confusion.  I'm just using the hadoop ui convention in the jobtracker.  I guess they had to do it this way since the ui is re-used between the mapper and the reducer tasks.

Attached is the script.  It took about 15 minutes to complete, with the majority of the time (~9 minutes) spent on the first reduce.

chuck.ds.qt 561803 lines
chuck.is.qt 1973303 lines
chuck.cb.qt 1973303 lines
chuck.rb.qt 2756325 lines
chuck.fa.qt 29327 lines

Thanks,
Chuck

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com]
Sent: Thu 9/4/2008 4:51 PM
To: pig-user@incubator.apache.org
Subject: Re: How are GROUP and COGROUP functions implemented?

If you gave sizes of your data, a copy of your script, and the time it 
took to run we can tell you if this matches with our experience.  I 
don't follow what you mean when you say it is taking a lot of time 
performing reduce->reduce.  On the web ui for your hadoop instance you 
should be able to see how much time hadoop is taking in the sort, 
shuffle, and reduce phases.  If a long time is being taken in the reduce 
itself, that is pig and not hadoop.

Alan.

Chuck Lan wrote:
> Thanks.  That's what I thought it was doing.  I wrote a Pig script to do
> some simple calculation based on some number of CSV files running on six
> nodes.  It took longer than I had hoped for the calculation to complete
> for six nodes.  Maybe I just need better disks, since it seems to be
> taking a lot of time performing reduce->reduce (vs. reduce->copy and
> reduce->sort).
>
> Do you know of any Hadoop issues around performance?
>
> Thanks,
> Chuck
>  
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Wednesday, September 03, 2008 2:23 PM
> To: pig-user@incubator.apache.org
> Subject: Re: How are GROUP and COGROUP functions implemented?
>
> Map Reduce is written around the idea of doing project or filter (map 
> phase), then grouping (sort and shuffle) and then applying any other 
> operations (such as aggregate functions, etc.) (reduce).  Since pig is 
> currently implemented on top of hadoop it makes use of hadoop's map 
> reduce implementation to do it's (co)grouping.  The efficiency is that 
> of a parallel sort.
>
> Alan.
>
> Chuck Lan wrote:
>   
>> Hi,
>>
>>  
>>
>> How are the GROUP and COGROUP functions implemented?  What's its
>> efficiency?
>>
>>  
>>
>> Thanks,
>>
>> Chuck
>>
>>
>>   
>>

Re: How are GROUP and COGROUP functions implemented?

Posted by Alan Gates <ga...@yahoo-inc.com>.

If you gave sizes of your data, a copy of your script, and the time it 
took to run we can tell you if this matches with our experience.  I 
don't follow what you mean when you say it is taking a lot of time 
performing reduce->reduce.  On the web ui for your hadoop instance you 
should be able to see how much time hadoop is taking in the sort, 
shuffle, and reduce phases.  If a long time is being taken in the reduce 
itself, that is pig and not hadoop.

Alan.

Chuck Lan wrote:
> Thanks.  That's what I thought it was doing.  I wrote a Pig script to do
> some simple calculation based on some number of CSV files running on six
> nodes.  It took longer than I had hoped for the calculation to complete
> for six nodes.  Maybe I just need better disks, since it seems to be
> taking a lot of time performing reduce->reduce (vs. reduce->copy and
> reduce->sort).
>
> Do you know of any Hadoop issues around performance?
>
> Thanks,
> Chuck
>  
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Wednesday, September 03, 2008 2:23 PM
> To: pig-user@incubator.apache.org
> Subject: Re: How are GROUP and COGROUP functions implemented?
>
> Map Reduce is written around the idea of doing project or filter (map 
> phase), then grouping (sort and shuffle) and then applying any other 
> operations (such as aggregate functions, etc.) (reduce).  Since pig is 
> currently implemented on top of hadoop it makes use of hadoop's map 
> reduce implementation to do it's (co)grouping.  The efficiency is that 
> of a parallel sort.
>
> Alan.
>
> Chuck Lan wrote:
>   
>> Hi,
>>
>>  
>>
>> How are the GROUP and COGROUP functions implemented?  What's its
>> efficiency?
>>
>>  
>>
>> Thanks,
>>
>> Chuck
>>
>>
>>   
>>

RE: How are GROUP and COGROUP functions implemented?

Posted by Chuck Lan <cl...@modeln.com>.

Thanks.  That's what I thought it was doing.  I wrote a Pig script to do
some simple calculation based on some number of CSV files running on six
nodes.  It took longer than I had hoped for the calculation to complete
for six nodes.  Maybe I just need better disks, since it seems to be
taking a lot of time performing reduce->reduce (vs. reduce->copy and
reduce->sort).

Do you know of any Hadoop issues around performance?

Thanks,
Chuck

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Wednesday, September 03, 2008 2:23 PM
To: pig-user@incubator.apache.org
Subject: Re: How are GROUP and COGROUP functions implemented?

Map Reduce is written around the idea of doing project or filter (map 
phase), then grouping (sort and shuffle) and then applying any other 
operations (such as aggregate functions, etc.) (reduce).  Since pig is 
currently implemented on top of hadoop it makes use of hadoop's map 
reduce implementation to do it's (co)grouping.  The efficiency is that 
of a parallel sort.

Alan.

Chuck Lan wrote:
> Hi,
>
>  
>
> How are the GROUP and COGROUP functions implemented?  What's its
> efficiency?
>
>  
>
> Thanks,
>
> Chuck
>
>
>

Re: How are GROUP and COGROUP functions implemented?

Posted by Alan Gates <ga...@yahoo-inc.com>.

Map Reduce is written around the idea of doing project or filter (map 
phase), then grouping (sort and shuffle) and then applying any other 
operations (such as aggregate functions, etc.) (reduce).  Since pig is 
currently implemented on top of hadoop it makes use of hadoop's map 
reduce implementation to do it's (co)grouping.  The efficiency is that 
of a parallel sort.

Alan.

Chuck Lan wrote:
> Hi,
>
>  
>
> How are the GROUP and COGROUP functions implemented?  What's its
> efficiency?
>
>  
>
> Thanks,
>
> Chuck
>
>
>