You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Charles Gonçalves <ch...@gmail.com> on 2011/02/11 17:57:51 UTC

PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

Is possible to use a parallel statment inside a nested foreach block like in
:

 28 E = GROUP B ALL PARALLEL 100;



 29



 30 edge_breakdown = FOREACH E {



 31   dist_cIps = DISTINCT B.cIp *PARALLEL X * ;



 32   dist_sIps = DISTINCT B.sIp ;



 33   urls_ok = FILTER B BY valid(url);



 34   GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
COUNT(B.url), SUM(B.scBytes);


 35 }

I got an error :
ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
line 36, column 36.
Was expecting:
    ";" ...

My problem is that I'm using  PARALLEL in line 28 an also setting the
 14 SET DEFAULT_PARALLEL 30;

But even though I'm gotting just one reducer !!

Is some optimization that I can disable?
I already tried to play with the  pig.exec.reducers.bytes.per.reducer and
nothin.
I'm processing 2TB of data an one reduce is yielding   no space left on
device error!

Any


-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
"Group all" puts everything into one group, so kinda hard to find
anything for the other 99 reducers :)

Which is ok if you are applying algebraic functions to it, like
counting or finding maxes of things, as those operations will be
pushed out to the mappers instead of building the group on a single
reducer.

D

On Fri, Feb 11, 2011 at 10:10 AM, Charles Gonçalves
<ch...@gmail.com> wrote:
> Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
> I got only one reduce (an
> obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
> process everything)
>
> I tried Group by something and worked.
> Could be some optimization issue!?
>
>
> On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> Possible, but it will be ignored.  Anything done inside a nested foreach
>> block will be executed at the parallel level of the preceding group by.
>>
>> Alan.
>>
>>
>> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>>
>>  Is possible to use a parallel statment inside a nested foreach block like
>>> in
>>> :
>>>
>>> 28 E = GROUP B ALL PARALLEL 100;
>>>
>>>
>>>
>>> 29
>>>
>>>
>>>
>>> 30 edge_breakdown = FOREACH E {
>>>
>>>
>>>
>>> 31   dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>>
>>>
>>>
>>> 32   dist_sIps = DISTINCT B.sIp ;
>>>
>>>
>>>
>>> 33   urls_ok = FILTER B BY valid(url);
>>>
>>>
>>>
>>> 34   GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>>> COUNT(B.url), SUM(B.scBytes);
>>>
>>>
>>> 35 }
>>>
>>> I got an error :
>>> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
>>> line 36, column 36.
>>> Was expecting:
>>>   ";" ...
>>>
>>> My problem is that I'm using  PARALLEL in line 28 an also setting the
>>> 14 SET DEFAULT_PARALLEL 30;
>>>
>>> But even though I'm gotting just one reducer !!
>>>
>>> Is some optimization that I can disable?
>>> I already tried to play with the  pig.exec.reducers.bytes.per.reducer and
>>> nothin.
>>> I'm processing 2TB of data an one reduce is yielding   no space left on
>>> device error!
>>>
>>> Any
>>>
>>>
>>> --
>>> *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.:  55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>

Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

Posted by Alan Gates <ga...@yahoo-inc.com>.
When you say group all it has to set parallel to 1, because you're  
telling it to collect everything together.

Alan.

On Feb 11, 2011, at 10:10 AM, Charles Gonçalves wrote:

> Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
> I got only one reduce (an
> obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
> process everything)
>
> I tried Group by something and worked.
> Could be some optimization issue!?
>
>
> On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com>  
> wrote:
>
>> Possible, but it will be ignored.  Anything done inside a nested  
>> foreach
>> block will be executed at the parallel level of the preceding group  
>> by.
>>
>> Alan.
>>
>>
>> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>>
>> Is possible to use a parallel statment inside a nested foreach  
>> block like
>>> in
>>> :
>>>
>>> 28 E = GROUP B ALL PARALLEL 100;
>>>
>>>
>>>
>>> 29
>>>
>>>
>>>
>>> 30 edge_breakdown = FOREACH E {
>>>
>>>
>>>
>>> 31   dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>>
>>>
>>>
>>> 32   dist_sIps = DISTINCT B.sIp ;
>>>
>>>
>>>
>>> 33   urls_ok = FILTER B BY valid(url);
>>>
>>>
>>>
>>> 34   GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>>> COUNT(B.url), SUM(B.scBytes);
>>>
>>>
>>> 35 }
>>>
>>> I got an error :
>>> ERROR 1000: Error during parsing. Encountered " "parallel"  
>>> "PARALLEL "" at
>>> line 36, column 36.
>>> Was expecting:
>>>  ";" ...
>>>
>>> My problem is that I'm using  PARALLEL in line 28 an also setting  
>>> the
>>> 14 SET DEFAULT_PARALLEL 30;
>>>
>>> But even though I'm gotting just one reducer !!
>>>
>>> Is some optimization that I can disable?
>>> I already tried to play with the   
>>> pig.exec.reducers.bytes.per.reducer and
>>> nothin.
>>> I'm processing 2TB of data an one reduce is yielding   no space  
>>> left on
>>> device error!
>>>
>>> Any
>>>
>>>
>>> --
>>> *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.:  55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>
>
> -- 
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840


Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

Posted by Charles Gonçalves <ch...@gmail.com>.
Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
I got only one reduce (an
obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
process everything)

I tried Group by something and worked.
Could be some optimization issue!?


On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Possible, but it will be ignored.  Anything done inside a nested foreach
> block will be executed at the parallel level of the preceding group by.
>
> Alan.
>
>
> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>
>  Is possible to use a parallel statment inside a nested foreach block like
>> in
>> :
>>
>> 28 E = GROUP B ALL PARALLEL 100;
>>
>>
>>
>> 29
>>
>>
>>
>> 30 edge_breakdown = FOREACH E {
>>
>>
>>
>> 31   dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>
>>
>>
>> 32   dist_sIps = DISTINCT B.sIp ;
>>
>>
>>
>> 33   urls_ok = FILTER B BY valid(url);
>>
>>
>>
>> 34   GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>> COUNT(B.url), SUM(B.scBytes);
>>
>>
>> 35 }
>>
>> I got an error :
>> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
>> line 36, column 36.
>> Was expecting:
>>   ";" ...
>>
>> My problem is that I'm using  PARALLEL in line 28 an also setting the
>> 14 SET DEFAULT_PARALLEL 30;
>>
>> But even though I'm gotting just one reducer !!
>>
>> Is some optimization that I can disable?
>> I already tried to play with the  pig.exec.reducers.bytes.per.reducer and
>> nothin.
>> I'm processing 2TB of data an one reduce is yielding   no space left on
>> device error!
>>
>> Any
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>
>


-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

Posted by Alan Gates <ga...@yahoo-inc.com>.
Possible, but it will be ignored.  Anything done inside a nested  
foreach block will be executed at the parallel level of the preceding  
group by.

Alan.

On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:

> Is possible to use a parallel statment inside a nested foreach block  
> like in
> :
>
> 28 E = GROUP B ALL PARALLEL 100;
>
>
>
> 29
>
>
>
> 30 edge_breakdown = FOREACH E {
>
>
>
> 31   dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>
>
>
> 32   dist_sIps = DISTINCT B.sIp ;
>
>
>
> 33   urls_ok = FILTER B BY valid(url);
>
>
>
> 34   GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
> COUNT(B.url), SUM(B.scBytes);
>
>
> 35 }
>
> I got an error :
> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL  
> "" at
> line 36, column 36.
> Was expecting:
>    ";" ...
>
> My problem is that I'm using  PARALLEL in line 28 an also setting the
> 14 SET DEFAULT_PARALLEL 30;
>
> But even though I'm gotting just one reducer !!
>
> Is some optimization that I can disable?
> I already tried to play with the   
> pig.exec.reducers.bytes.per.reducer and
> nothin.
> I'm processing 2TB of data an one reduce is yielding   no space left  
> on
> device error!
>
> Any
>
>
> -- 
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840