You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/20 10:58:18 UTC

WordCount Results Version 2 - Pig 0.6.0

Hi again,

The results have been produced. I can tell you that I made the following
improvements:
1. Removed unnecessary "words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(\$0));"
2. Using PigStorage, and not deprecated TextLoader
3. Using Pig 0.6.0 , not 0.5.0


Here is the resulting Pig script, executed by pig 0.6.0:
--------------
myinput = LOAD 'Inputs/WordCount/wordsx1_skewed.dat' USING PigStorage();
grouped = GROUP myinput BY \$0 PARALLEL 56;
counts = FOREACH grouped GENERATE group, COUNT(myinput) AS total;
STORE counts INTO 'Outputs/WordCount/wordCountx1_skewed.pig' USING
PigStorage();
--------------

The results can be found here:
www.macs.hw.ac.uk/~rs46/WordCount_ScaleUp_Results_version2.pdf

It brings Pig into the equation, though Hive seems most optimal. Remember
that this is just for a GROUP and COUNT (for word count). I will run others
("skewed JOIN" and "Group, Count and Order by") which may show Pig to have
some optimizing properties.

Rob Stewart

Re: WordCount Results Version 2 - Pig 0.6.0

Posted by Alan Gates <ga...@yahoo-inc.com>.

Sorry, I misread Mridul's comment and your changes to the script.  I  
thought the parallel keyword went away, but I see it didn't.  I am  
curious if you played with different values of parallelism to see how  
they affected performance.

Alan.

On Jan 20, 2010, at 1:24 PM, Rob Stewart wrote:

> Hi Alan, I'm not quite sure what you mean. As shown in my pig  
> script, I have
> stated to have 56 reducers for the "group by" task. And the number of
> mappers is decided by hadoop. Is there any way to optimize my pig  
> script
> further?
>
> On 20 Jan 2010 19:07, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>
> Are you setting parallel as Mirdul suggests?  Or does your cluster  
> have a
> default parallelism set?
>
> Alan.
>
> On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote: > Hi again, > > The  
> results
> have been produced. I...

Re: WordCount Results Version 2 - Pig 0.6.0

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Alan, I'm not quite sure what you mean. As shown in my pig script, I have
stated to have 56 reducers for the "group by" task. And the number of
mappers is decided by hadoop. Is there any way to optimize my pig script
further?

On 20 Jan 2010 19:07, "Alan Gates" <ga...@yahoo-inc.com> wrote:

Are you setting parallel as Mirdul suggests?  Or does your cluster have a
default parallelism set?

Alan.

On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote: > Hi again, > > The results
have been produced. I...

Re: WordCount Results Version 2 - Pig 0.6.0

Posted by Alan Gates <ga...@yahoo-inc.com>.

Are you setting parallel as Mirdul suggests?  Or does your cluster  
have a default parallelism set?

Alan.

On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote:

> Hi again,
>
> The results have been produced. I can tell you that I made the  
> following
> improvements:
> 1. Removed unnecessary "words = FOREACH myinput GENERATE
> FLATTEN(TOKENIZE(\$0));"
> 2. Using PigStorage, and not deprecated TextLoader
> 3. Using Pig 0.6.0 , not 0.5.0
>
>
> Here is the resulting Pig script, executed by pig 0.6.0:
> --------------
> myinput = LOAD 'Inputs/WordCount/wordsx1_skewed.dat' USING  
> PigStorage();
> grouped = GROUP myinput BY \$0 PARALLEL 56;
> counts = FOREACH grouped GENERATE group, COUNT(myinput) AS total;
> STORE counts INTO 'Outputs/WordCount/wordCountx1_skewed.pig' USING
> PigStorage();
> --------------
>
> The results can be found here:
> www.macs.hw.ac.uk/~rs46/WordCount_ScaleUp_Results_version2.pdf
>
> It brings Pig into the equation, though Hive seems most optimal.  
> Remember
> that this is just for a GROUP and COUNT (for word count). I will run  
> others
> ("skewed JOIN" and "Group, Count and Order by") which may show Pig  
> to have
> some optimizing properties.
>
> Rob Stewart