You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Cosmin Lehene <cl...@adobe.com> on 2008/05/21 16:39:35 UTC

Using PARALLEL clause for nested operations

Hi, 

I'm trying to use the PARALLEL clause inside a

FOREACH X {
    ...
    ... GENERATE ... PARALLLEL N;
}

And it fails.

670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered "PARALLEL"
at line 3, column 64.
Was expecting one of:
    ";" ...
    "," ...
    ":" ...

Is it limited to regular statements?

Thanks,
Cosmin 


RE: Using PARALLEL clause for nested operations

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
It's a syntactically correct statement in QueryParser. However, the pig
statement is shipped to QueryParser via Grunt which does not expect the
PARALLEL keyword. There are two parsers you are dealing with:

1. Grunt parser that ships pig statements to QueryParser
2. QueryParser that parses the pig statements

The problem in this specific case is with the former and not the latter.

Santhosh 

-----Original Message-----
From: pi song [mailto:pi.songs@gmail.com] 
Sent: Wednesday, May 21, 2008 7:59 AM
To: pig-user@incubator.apache.org
Subject: Re: Using PARALLEL clause for nested operations

Of course, this is an open source project. Everybody can see the source
code.

Have a look at
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/impl
/logicalLayer/parser/QueryParser.jjt?view=log

And here is the root source location:
http://svn.apache.org/viewvc/incubator/pig/trunk

It's a JavaCC file so a bit difficult to read. No good debugging tool
either.

Pi


On Thu, May 22, 2008 at 12:53 AM, Cosmin Lehene <cl...@adobe.com>
wrote:

> Yes, well, I tried both. Didn't see the grammar definition, though.
Where
> can I find it, so I could look in the code as well?
>
> Thanks,
> Cosmin
>
>
> On 5/21/08 5:50 PM, "pi song" <pi...@gmail.com> wrote:
>
> > According to the grammar file, it should be:-
> >
> > FOREACH X {
> >    ...
> >    ... GENERATE ... ;
> > } PARALLLEL N;
> >
> > But it doesn't work. I guess this is a bug!!
> >
> > Pi
> >
> > On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene <cl...@adobe.com>
> wrote:
> >
> >> Hi,
> >>
> >> I'm trying to use the PARALLEL clause inside a
> >>
> >> FOREACH X {
> >>    ...
> >>    ... GENERATE ... PARALLLEL N;
> >> }
> >>
> >> And it fails.
> >>
> >> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
> >> "PARALLEL"
> >> at line 3, column 64.
> >> Was expecting one of:
> >>    ";" ...
> >>    "," ...
> >>    ":" ...
> >>
> >> Is it limited to regular statements?
> >>
> >> Thanks,
> >> Cosmin
> >>
> >>
>
>

Re: Using PARALLEL clause for nested operations

Posted by Cosmin Lehene <cl...@adobe.com>.
Just figured that QueryParser.java is generated... I got fooled by the stack
trace...

That's a strange idiom in QueryParser.jjt :)

However, is it normal to get a single reduce job if you don't specify a
PARALLEL clause?

Cosmin

On 5/21/08 5:58 PM, "pi song" <pi...@gmail.com> wrote:

> Of course, this is an open source project. Everybody can see the source
> code.
> 
> Have a look at
> http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/impl/logic
> alLayer/parser/QueryParser.jjt?view=log
> 
> And here is the root source location:
> http://svn.apache.org/viewvc/incubator/pig/trunk
> 
> It's a JavaCC file so a bit difficult to read. No good debugging tool
> either.
> 
> Pi
> 
> 
> On Thu, May 22, 2008 at 12:53 AM, Cosmin Lehene <cl...@adobe.com> wrote:
> 
>> Yes, well, I tried both. Didn't see the grammar definition, though. Where
>> can I find it, so I could look in the code as well?
>> 
>> Thanks,
>> Cosmin
>> 
>> 
>> On 5/21/08 5:50 PM, "pi song" <pi...@gmail.com> wrote:
>> 
>>> According to the grammar file, it should be:-
>>> 
>>> FOREACH X {
>>>    ...
>>>    ... GENERATE ... ;
>>> } PARALLLEL N;
>>> 
>>> But it doesn't work. I guess this is a bug!!
>>> 
>>> Pi
>>> 
>>> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene <cl...@adobe.com>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I'm trying to use the PARALLEL clause inside a
>>>> 
>>>> FOREACH X {
>>>>    ...
>>>>    ... GENERATE ... PARALLLEL N;
>>>> }
>>>> 
>>>> And it fails.
>>>> 
>>>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
>>>> "PARALLEL"
>>>> at line 3, column 64.
>>>> Was expecting one of:
>>>>    ";" ...
>>>>    "," ...
>>>>    ":" ...
>>>> 
>>>> Is it limited to regular statements?
>>>> 
>>>> Thanks,
>>>> Cosmin
>>>> 
>>>> 
>> 
>> 


Re: Using PARALLEL clause for nested operations

Posted by pi song <pi...@gmail.com>.
Of course, this is an open source project. Everybody can see the source
code.

Have a look at
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt?view=log

And here is the root source location:
http://svn.apache.org/viewvc/incubator/pig/trunk

It's a JavaCC file so a bit difficult to read. No good debugging tool
either.

Pi


On Thu, May 22, 2008 at 12:53 AM, Cosmin Lehene <cl...@adobe.com> wrote:

> Yes, well, I tried both. Didn't see the grammar definition, though. Where
> can I find it, so I could look in the code as well?
>
> Thanks,
> Cosmin
>
>
> On 5/21/08 5:50 PM, "pi song" <pi...@gmail.com> wrote:
>
> > According to the grammar file, it should be:-
> >
> > FOREACH X {
> >    ...
> >    ... GENERATE ... ;
> > } PARALLLEL N;
> >
> > But it doesn't work. I guess this is a bug!!
> >
> > Pi
> >
> > On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene <cl...@adobe.com>
> wrote:
> >
> >> Hi,
> >>
> >> I'm trying to use the PARALLEL clause inside a
> >>
> >> FOREACH X {
> >>    ...
> >>    ... GENERATE ... PARALLLEL N;
> >> }
> >>
> >> And it fails.
> >>
> >> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
> >> "PARALLEL"
> >> at line 3, column 64.
> >> Was expecting one of:
> >>    ";" ...
> >>    "," ...
> >>    ":" ...
> >>
> >> Is it limited to regular statements?
> >>
> >> Thanks,
> >> Cosmin
> >>
> >>
>
>

RE: Using PARALLEL clause for nested operations

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Pig combines multiple operators into single map-reduce job. For your
example, there will be only 1 M-R job with Map performing the load and
reduce performing the rest of the computation. So putting PARALLEL on
group would achieve exactly what you need.

Also you can run EXPLAIN command on the alias you about to store, to see
how it would get executed. In your example you can say

EXPLAIN R;

Olga 

> -----Original Message-----
> From: Cosmin Lehene [mailto:clehene@adobe.com] 
> Sent: Wednesday, May 21, 2008 12:42 PM
> To: pig-user@incubator.apache.org
> Subject: Re: Using PARALLEL clause for nested operations
> 
> Let's take this example:
> 
> W = LOAD '...' AS (url, outlink);
> G = GROUP W by url;
> R = FOREACH G {
>         FW = FILTER W BY outlink eq 'www.apache.org';
>         PW = FW.outlink;
>         DW = DISTINCT PW;
>         GENERATE group, COUNT(DW);
> }
> 
> 
> I'd be able to do G = GROUP W by url PARALLEL 100;
> 
> However the FOREACH could take advantage of reduce 
> parallelism. In my case it's the last FOREACH statement of a 
> pig script that takes one hour - compared with the rest of 
> the processing that is parallelized and takes only a couple 
> of minutes.
> 
> Cosmin
> 
> 
> 
> On 5/21/08 10:02 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> 
> > This is not a limitation. It is sufficient to put parallel on the 
> > first operator in a reduce state which could only be one of the 
> > operators listed below. In your example, group precedes 
> foreach and it 
> > is that group that should have parallel attached to it.
> > 
> > Olga
> > 
> >> -----Original Message-----
> >> From: Cosmin Lehene [mailto:clehene@adobe.com]
> >> Sent: Wednesday, May 21, 2008 9:03 AM
> >> To: pig-user@incubator.apache.org
> >> Subject: Re: Using PARALLEL clause for nested operations
> >> 
> >> Thanks Olga,
> >> 
> >> However, is that a limitation? If not, I'd be interested 
> to know the 
> >> rational explanation behind this.
> >>  
> >> I mean, if you FOREACH...GENERATE group AS item, count(x) for 
> >> something, it seems to be the case for an "item" based reducer, 
> >> shouldn't it?
> >> 
> >> Cosmin
> >> 
> >> 
> >> On 5/21/08 6:49 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> >> 
> >>> Having parallel on foreach has no effect. Parallel controls
> >> the number
> >>> of reducers and should be placed on one of the following:
> >>> 
> >>> GROUP
> >>> COGROUP
> >>> SORT
> >>> 
> >>> Olga
> >>> 
> >>>> -----Original Message-----
> >>>> From: pi song [mailto:pi.songs@gmail.com]
> >>>> Sent: Wednesday, May 21, 2008 7:50 AM
> >>>> To: pig-user@incubator.apache.org
> >>>> Subject: Re: Using PARALLEL clause for nested operations
> >>>> 
> >>>> According to the grammar file, it should be:-
> >>>> 
> >>>> FOREACH X {
> >>>>    ...
> >>>>    ... GENERATE ... ;
> >>>> } PARALLLEL N;
> >>>> 
> >>>> But it doesn't work. I guess this is a bug!!
> >>>> 
> >>>> Pi
> >>>> 
> >>>> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene
> >> <cl...@adobe.com>
> >>>> wrote:
> >>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> I'm trying to use the PARALLEL clause inside a
> >>>>> 
> >>>>> FOREACH X {
> >>>>>    ...
> >>>>>    ... GENERATE ... PARALLLEL N;
> >>>>> }
> >>>>> 
> >>>>> And it fails.
> >>>>> 
> >>>>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - 
> Encountered 
> >>>>> "PARALLEL"
> >>>>> at line 3, column 64.
> >>>>> Was expecting one of:
> >>>>>    ";" ...
> >>>>>    "," ...
> >>>>>    ":" ...
> >>>>> 
> >>>>> Is it limited to regular statements?
> >>>>> 
> >>>>> Thanks,
> >>>>> Cosmin
> >>>>> 
> >>>>> 
> >>>> 
> >> 
> >> 
> 
> 

Re: Using PARALLEL clause for nested operations

Posted by Cosmin Lehene <cl...@adobe.com>.
Let's take this example:

W = LOAD '...' AS (url, outlink);
G = GROUP W by url;
R = FOREACH G {
        FW = FILTER W BY outlink eq 'www.apache.org';
        PW = FW.outlink;
        DW = DISTINCT PW;
        GENERATE group, COUNT(DW);
}


I'd be able to do G = GROUP W by url PARALLEL 100;

However the FOREACH could take advantage of reduce parallelism. In my case
it's the last FOREACH statement of a pig script that takes one hour -
compared with the rest of the processing that is parallelized and takes only
a couple of minutes.

Cosmin



On 5/21/08 10:02 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> This is not a limitation. It is sufficient to put parallel on the first
> operator in a reduce state which could only be one of the operators
> listed below. In your example, group precedes foreach and it is that
> group that should have parallel attached to it.
> 
> Olga 
> 
>> -----Original Message-----
>> From: Cosmin Lehene [mailto:clehene@adobe.com]
>> Sent: Wednesday, May 21, 2008 9:03 AM
>> To: pig-user@incubator.apache.org
>> Subject: Re: Using PARALLEL clause for nested operations
>> 
>> Thanks Olga, 
>> 
>> However, is that a limitation? If not, I'd be interested to
>> know the rational explanation behind this.
>>  
>> I mean, if you FOREACH...GENERATE group AS item, count(x) for
>> something, it seems to be the case for an "item" based
>> reducer, shouldn't it?
>> 
>> Cosmin
>> 
>> 
>> On 5/21/08 6:49 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
>> 
>>> Having parallel on foreach has no effect. Parallel controls
>> the number 
>>> of reducers and should be placed on one of the following:
>>> 
>>> GROUP
>>> COGROUP
>>> SORT
>>> 
>>> Olga
>>> 
>>>> -----Original Message-----
>>>> From: pi song [mailto:pi.songs@gmail.com]
>>>> Sent: Wednesday, May 21, 2008 7:50 AM
>>>> To: pig-user@incubator.apache.org
>>>> Subject: Re: Using PARALLEL clause for nested operations
>>>> 
>>>> According to the grammar file, it should be:-
>>>> 
>>>> FOREACH X {
>>>>    ...
>>>>    ... GENERATE ... ;
>>>> } PARALLLEL N;
>>>> 
>>>> But it doesn't work. I guess this is a bug!!
>>>> 
>>>> Pi
>>>> 
>>>> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene
>> <cl...@adobe.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm trying to use the PARALLEL clause inside a
>>>>> 
>>>>> FOREACH X {
>>>>>    ...
>>>>>    ... GENERATE ... PARALLLEL N;
>>>>> }
>>>>> 
>>>>> And it fails.
>>>>> 
>>>>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
>>>>> "PARALLEL"
>>>>> at line 3, column 64.
>>>>> Was expecting one of:
>>>>>    ";" ...
>>>>>    "," ...
>>>>>    ":" ...
>>>>> 
>>>>> Is it limited to regular statements?
>>>>> 
>>>>> Thanks,
>>>>> Cosmin
>>>>> 
>>>>> 
>>>> 
>> 
>> 


RE: Using PARALLEL clause for nested operations

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
This is not a limitation. It is sufficient to put parallel on the first
operator in a reduce state which could only be one of the operators
listed below. In your example, group precedes foreach and it is that
group that should have parallel attached to it.

Olga 

> -----Original Message-----
> From: Cosmin Lehene [mailto:clehene@adobe.com] 
> Sent: Wednesday, May 21, 2008 9:03 AM
> To: pig-user@incubator.apache.org
> Subject: Re: Using PARALLEL clause for nested operations
> 
> Thanks Olga, 
> 
> However, is that a limitation? If not, I'd be interested to 
> know the rational explanation behind this.
>  
> I mean, if you FOREACH...GENERATE group AS item, count(x) for 
> something, it seems to be the case for an "item" based 
> reducer, shouldn't it?
> 
> Cosmin
> 
> 
> On 5/21/08 6:49 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> 
> > Having parallel on foreach has no effect. Parallel controls 
> the number 
> > of reducers and should be placed on one of the following:
> > 
> > GROUP
> > COGROUP
> > SORT
> > 
> > Olga
> > 
> >> -----Original Message-----
> >> From: pi song [mailto:pi.songs@gmail.com]
> >> Sent: Wednesday, May 21, 2008 7:50 AM
> >> To: pig-user@incubator.apache.org
> >> Subject: Re: Using PARALLEL clause for nested operations
> >> 
> >> According to the grammar file, it should be:-
> >> 
> >> FOREACH X {
> >>    ...
> >>    ... GENERATE ... ;
> >> } PARALLLEL N;
> >> 
> >> But it doesn't work. I guess this is a bug!!
> >> 
> >> Pi
> >> 
> >> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene 
> <cl...@adobe.com> 
> >> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> I'm trying to use the PARALLEL clause inside a
> >>> 
> >>> FOREACH X {
> >>>    ...
> >>>    ... GENERATE ... PARALLLEL N;
> >>> }
> >>> 
> >>> And it fails.
> >>> 
> >>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered 
> >>> "PARALLEL"
> >>> at line 3, column 64.
> >>> Was expecting one of:
> >>>    ";" ...
> >>>    "," ...
> >>>    ":" ...
> >>> 
> >>> Is it limited to regular statements?
> >>> 
> >>> Thanks,
> >>> Cosmin
> >>> 
> >>> 
> >> 
> 
> 

Re: Using PARALLEL clause for nested operations

Posted by Cosmin Lehene <cl...@adobe.com>.
Thanks Olga, 

However, is that a limitation? If not, I'd be interested to know the
rational explanation behind this.
 
I mean, if you FOREACH...GENERATE group AS item, count(x) for something, it
seems to be the case for an "item" based reducer, shouldn't it?

Cosmin


On 5/21/08 6:49 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> Having parallel on foreach has no effect. Parallel controls the number
> of reducers and should be placed on one of the following:
> 
> GROUP
> COGROUP
> SORT
> 
> Olga 
> 
>> -----Original Message-----
>> From: pi song [mailto:pi.songs@gmail.com]
>> Sent: Wednesday, May 21, 2008 7:50 AM
>> To: pig-user@incubator.apache.org
>> Subject: Re: Using PARALLEL clause for nested operations
>> 
>> According to the grammar file, it should be:-
>> 
>> FOREACH X {
>>    ...
>>    ... GENERATE ... ;
>> } PARALLLEL N;
>> 
>> But it doesn't work. I guess this is a bug!!
>> 
>> Pi
>> 
>> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene
>> <cl...@adobe.com> wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to use the PARALLEL clause inside a
>>> 
>>> FOREACH X {
>>>    ...
>>>    ... GENERATE ... PARALLLEL N;
>>> }
>>> 
>>> And it fails.
>>> 
>>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
>>> "PARALLEL"
>>> at line 3, column 64.
>>> Was expecting one of:
>>>    ";" ...
>>>    "," ...
>>>    ":" ...
>>> 
>>> Is it limited to regular statements?
>>> 
>>> Thanks,
>>> Cosmin
>>> 
>>> 
>> 


RE: Using PARALLEL clause for nested operations

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Having parallel on foreach has no effect. Parallel controls the number
of reducers and should be placed on one of the following:

GROUP
COGROUP
SORT

Olga 

> -----Original Message-----
> From: pi song [mailto:pi.songs@gmail.com] 
> Sent: Wednesday, May 21, 2008 7:50 AM
> To: pig-user@incubator.apache.org
> Subject: Re: Using PARALLEL clause for nested operations
> 
> According to the grammar file, it should be:-
> 
> FOREACH X {
>    ...
>    ... GENERATE ... ;
> } PARALLLEL N;
> 
> But it doesn't work. I guess this is a bug!!
> 
> Pi
> 
> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene 
> <cl...@adobe.com> wrote:
> 
> > Hi,
> >
> > I'm trying to use the PARALLEL clause inside a
> >
> > FOREACH X {
> >    ...
> >    ... GENERATE ... PARALLLEL N;
> > }
> >
> > And it fails.
> >
> > 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered 
> > "PARALLEL"
> > at line 3, column 64.
> > Was expecting one of:
> >    ";" ...
> >    "," ...
> >    ":" ...
> >
> > Is it limited to regular statements?
> >
> > Thanks,
> > Cosmin
> >
> >
> 

Re: Using PARALLEL clause for nested operations

Posted by Cosmin Lehene <cl...@adobe.com>.
Yes, well, I tried both. Didn't see the grammar definition, though. Where
can I find it, so I could look in the code as well?

Thanks,
Cosmin


On 5/21/08 5:50 PM, "pi song" <pi...@gmail.com> wrote:

> According to the grammar file, it should be:-
> 
> FOREACH X {
>    ...
>    ... GENERATE ... ;
> } PARALLLEL N;
> 
> But it doesn't work. I guess this is a bug!!
> 
> Pi
> 
> On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene <cl...@adobe.com> wrote:
> 
>> Hi,
>> 
>> I'm trying to use the PARALLEL clause inside a
>> 
>> FOREACH X {
>>    ...
>>    ... GENERATE ... PARALLLEL N;
>> }
>> 
>> And it fails.
>> 
>> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
>> "PARALLEL"
>> at line 3, column 64.
>> Was expecting one of:
>>    ";" ...
>>    "," ...
>>    ":" ...
>> 
>> Is it limited to regular statements?
>> 
>> Thanks,
>> Cosmin
>> 
>> 


Re: Using PARALLEL clause for nested operations

Posted by pi song <pi...@gmail.com>.
According to the grammar file, it should be:-

FOREACH X {
   ...
   ... GENERATE ... ;
} PARALLLEL N;

But it doesn't work. I guess this is a bug!!

Pi

On Thu, May 22, 2008 at 12:39 AM, Cosmin Lehene <cl...@adobe.com> wrote:

> Hi,
>
> I'm trying to use the PARALLEL clause inside a
>
> FOREACH X {
>    ...
>    ... GENERATE ... PARALLLEL N;
> }
>
> And it fails.
>
> 670  [main] ERROR org.apache.pig.tools.grunt.Grunt  - Encountered
> "PARALLEL"
> at line 3, column 64.
> Was expecting one of:
>    ";" ...
>    "," ...
>    ":" ...
>
> Is it limited to regular statements?
>
> Thanks,
> Cosmin
>
>