You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Charles Gonçalves <ch...@gmail.com> on 2011/02/11 17:57:51 UTC
PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!
Is possible to use a parallel statment inside a nested foreach block like in
:
28 E = GROUP B ALL PARALLEL 100;
29
30 edge_breakdown = FOREACH E {
31 dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
32 dist_sIps = DISTINCT B.sIp ;
33 urls_ok = FILTER B BY valid(url);
34 GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
COUNT(B.url), SUM(B.scBytes);
35 }
I got an error :
ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
line 36, column 36.
Was expecting:
";" ...
My problem is that I'm using PARALLEL in line 28 an also setting the
14 SET DEFAULT_PARALLEL 30;
But even though I'm gotting just one reducer !!
Is some optimization that I can disable?
I already tried to play with the pig.exec.reducers.bytes.per.reducer and
nothin.
I'm processing 2TB of data an one reduce is yielding no space left on
device error!
Any
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840
Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
"Group all" puts everything into one group, so kinda hard to find
anything for the other 99 reducers :)
Which is ok if you are applying algebraic functions to it, like
counting or finding maxes of things, as those operations will be
pushed out to the mappers instead of building the group on a single
reducer.
D
On Fri, Feb 11, 2011 at 10:10 AM, Charles Gonçalves
<ch...@gmail.com> wrote:
> Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
> I got only one reduce (an
> obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
> process everything)
>
> I tried Group by something and worked.
> Could be some optimization issue!?
>
>
> On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> Possible, but it will be ignored. Anything done inside a nested foreach
>> block will be executed at the parallel level of the preceding group by.
>>
>> Alan.
>>
>>
>> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>>
>> Is possible to use a parallel statment inside a nested foreach block like
>>> in
>>> :
>>>
>>> 28 E = GROUP B ALL PARALLEL 100;
>>>
>>>
>>>
>>> 29
>>>
>>>
>>>
>>> 30 edge_breakdown = FOREACH E {
>>>
>>>
>>>
>>> 31 dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>>
>>>
>>>
>>> 32 dist_sIps = DISTINCT B.sIp ;
>>>
>>>
>>>
>>> 33 urls_ok = FILTER B BY valid(url);
>>>
>>>
>>>
>>> 34 GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>>> COUNT(B.url), SUM(B.scBytes);
>>>
>>>
>>> 35 }
>>>
>>> I got an error :
>>> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
>>> line 36, column 36.
>>> Was expecting:
>>> ";" ...
>>>
>>> My problem is that I'm using PARALLEL in line 28 an also setting the
>>> 14 SET DEFAULT_PARALLEL 30;
>>>
>>> But even though I'm gotting just one reducer !!
>>>
>>> Is some optimization that I can disable?
>>> I already tried to play with the pig.exec.reducers.bytes.per.reducer and
>>> nothin.
>>> I'm processing 2TB of data an one reduce is yielding no space left on
>>> device error!
>>>
>>> Any
>>>
>>>
>>> --
>>> *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.: 55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.: 55 31 34741485
> Lab.: 55 31 34095840
>
Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!
Posted by Alan Gates <ga...@yahoo-inc.com>.
When you say group all it has to set parallel to 1, because you're
telling it to collect everything together.
Alan.
On Feb 11, 2011, at 10:10 AM, Charles Gonçalves wrote:
> Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
> I got only one reduce (an
> obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
> process everything)
>
> I tried Group by something and worked.
> Could be some optimization issue!?
>
>
> On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com>
> wrote:
>
>> Possible, but it will be ignored. Anything done inside a nested
>> foreach
>> block will be executed at the parallel level of the preceding group
>> by.
>>
>> Alan.
>>
>>
>> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>>
>> Is possible to use a parallel statment inside a nested foreach
>> block like
>>> in
>>> :
>>>
>>> 28 E = GROUP B ALL PARALLEL 100;
>>>
>>>
>>>
>>> 29
>>>
>>>
>>>
>>> 30 edge_breakdown = FOREACH E {
>>>
>>>
>>>
>>> 31 dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>>
>>>
>>>
>>> 32 dist_sIps = DISTINCT B.sIp ;
>>>
>>>
>>>
>>> 33 urls_ok = FILTER B BY valid(url);
>>>
>>>
>>>
>>> 34 GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>>> COUNT(B.url), SUM(B.scBytes);
>>>
>>>
>>> 35 }
>>>
>>> I got an error :
>>> ERROR 1000: Error during parsing. Encountered " "parallel"
>>> "PARALLEL "" at
>>> line 36, column 36.
>>> Was expecting:
>>> ";" ...
>>>
>>> My problem is that I'm using PARALLEL in line 28 an also setting
>>> the
>>> 14 SET DEFAULT_PARALLEL 30;
>>>
>>> But even though I'm gotting just one reducer !!
>>>
>>> Is some optimization that I can disable?
>>> I already tried to play with the
>>> pig.exec.reducers.bytes.per.reducer and
>>> nothin.
>>> I'm processing 2TB of data an one reduce is yielding no space
>>> left on
>>> device error!
>>>
>>> Any
>>>
>>>
>>> --
>>> *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.: 55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.: 55 31 34741485
> Lab.: 55 31 34095840
Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!
Posted by Charles Gonçalves <ch...@gmail.com>.
Yes, but even using the : E = GROUP B *ALL PARALLEL 100;*
I got only one reduce (an
obviously<http://www.youtube.com/watch?v=hMtZfW2z9dw>no space to
process everything)
I tried Group by something and worked.
Could be some optimization issue!?
On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> Possible, but it will be ignored. Anything done inside a nested foreach
> block will be executed at the parallel level of the preceding group by.
>
> Alan.
>
>
> On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
>
> Is possible to use a parallel statment inside a nested foreach block like
>> in
>> :
>>
>> 28 E = GROUP B ALL PARALLEL 100;
>>
>>
>>
>> 29
>>
>>
>>
>> 30 edge_breakdown = FOREACH E {
>>
>>
>>
>> 31 dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>>
>>
>>
>> 32 dist_sIps = DISTINCT B.sIp ;
>>
>>
>>
>> 33 urls_ok = FILTER B BY valid(url);
>>
>>
>>
>> 34 GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
>> COUNT(B.url), SUM(B.scBytes);
>>
>>
>> 35 }
>>
>> I got an error :
>> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL "" at
>> line 36, column 36.
>> Was expecting:
>> ";" ...
>>
>> My problem is that I'm using PARALLEL in line 28 an also setting the
>> 14 SET DEFAULT_PARALLEL 30;
>>
>> But even though I'm gotting just one reducer !!
>>
>> Is some optimization that I can disable?
>> I already tried to play with the pig.exec.reducers.bytes.per.reducer and
>> nothin.
>> I'm processing 2TB of data an one reduce is yielding no space left on
>> device error!
>>
>> Any
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.: 55 31 34741485
>> Lab.: 55 31 34095840
>>
>
>
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840
Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!
Posted by Alan Gates <ga...@yahoo-inc.com>.
Possible, but it will be ignored. Anything done inside a nested
foreach block will be executed at the parallel level of the preceding
group by.
Alan.
On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote:
> Is possible to use a parallel statment inside a nested foreach block
> like in
> :
>
> 28 E = GROUP B ALL PARALLEL 100;
>
>
>
> 29
>
>
>
> 30 edge_breakdown = FOREACH E {
>
>
>
> 31 dist_cIps = DISTINCT B.cIp *PARALLEL X * ;
>
>
>
> 32 dist_sIps = DISTINCT B.sIp ;
>
>
>
> 33 urls_ok = FILTER B BY valid(url);
>
>
>
> 34 GENERATE COUNT(dist_cIps),COUNT(dist_sIps) ,COUNT(urls_ok.url),
> COUNT(B.url), SUM(B.scBytes);
>
>
> 35 }
>
> I got an error :
> ERROR 1000: Error during parsing. Encountered " "parallel" "PARALLEL
> "" at
> line 36, column 36.
> Was expecting:
> ";" ...
>
> My problem is that I'm using PARALLEL in line 28 an also setting the
> 14 SET DEFAULT_PARALLEL 30;
>
> But even though I'm gotting just one reducer !!
>
> Is some optimization that I can disable?
> I already tried to play with the
> pig.exec.reducers.bytes.per.reducer and
> nothin.
> I'm processing 2TB of data an one reduce is yielding no space left
> on
> device error!
>
> Any
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.: 55 31 34741485
> Lab.: 55 31 34095840