You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/06/30 20:20:58 UTC

Re: The plan generated for this nested plan is not as per we had discussed

Analysis below.

Shravan M Narayanamurthy wrote:
> Hi Guys,
> I think we need to find a proper set of rules for the project's 
> schema. The following script kinda of covers all the scenarios:
> A = load 'a';
> B = group A by $0;
> C = foreach B {
> C1 = filter A by $0>5;
> C2 = distinct C1;
> C3 = distinct A;
> generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), udf(C3.$1);
> }
>
> I think, we had not thought about the projection in the inner plan of 
> filter. With this constraint, we need a new set of rules. Can you post 
> an algorithm that will work to set the return types of the projects?
>
> Thanks & Regards,
> --Shravan
>
> <snip>
In this case, the foreach should have the following plans:

0 - proj(0)

1 - proj( * ) -> udf1

2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2

3 - proj (1) -> filter -> distinct -> proj(1) -> udf3

4 - proj(1) -> distinct -> proj( * ) -> udf4

5 - proj(1) -> distinct -> proj(1) -> udf5

In plans 2 and 3, filter will have an inner plan of:

proj(0) -> gt, const(5) -> gt

In discussing the scenario, Santhosh and I saw one issue, which is that 
in plan 1, the proj( * ) will be incorrectly trying to accumulate a bag 
for udf1, when it should just pass the tuple.  Santhosh is going to fix 
that by changing the project to determine whether it has a predecessor, 
and if so whether that predecessor is a relational operator, instead of 
looking at its input to see if it's a relational operator.

I didn't follow your comment on the issue with the project in the filter 
plan.  It looked fine to me.

Alan.

Re: The plan generated for this nested plan is not as per we had discussed

Posted by Alan Gates <ga...@yahoo-inc.com>.

Yes.  As previously discussed, inner plans are duplicated at this point, 
rather than having splits inserted.  One future optimization we need to 
add is putting these splits in place.  Determining when to put in the 
splits is easy.  But first we need to write an efficient split 
implementation to handle this.

Alan.

Olga Natkovich wrote:
> Does this mean that distinct and filter will be recomputed several
> times?
>
> Olga 
>
>   
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com] 
>> Sent: Monday, June 30, 2008 11:21 AM
>> To: Shravan Narayanamurthy
>> Cc: Santhosh Srinivasan; pig-dev@incubator.apache.org
>> Subject: Re: The plan generated for this nested plan is not 
>> as per we had discussed
>>
>> Analysis below.
>>
>> Shravan M Narayanamurthy wrote:
>>     
>>> Hi Guys,
>>> I think we need to find a proper set of rules for the project's 
>>> schema. The following script kinda of covers all the scenarios:
>>> A = load 'a';
>>> B = group A by $0;
>>> C = foreach B {
>>> C1 = filter A by $0>5;
>>> C2 = distinct C1;
>>> C3 = distinct A;
>>> generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), 
>>>       
>> udf(C3.$1); 
>>     
>>> }
>>>
>>> I think, we had not thought about the projection in the 
>>>       
>> inner plan of 
>>     
>>> filter. With this constraint, we need a new set of rules. 
>>>       
>> Can you post 
>>     
>>> an algorithm that will work to set the return types of the projects?
>>>
>>> Thanks & Regards,
>>> --Shravan
>>>
>>> <snip>
>>>       
>> In this case, the foreach should have the following plans:
>>
>> 0 - proj(0)
>>
>> 1 - proj( * ) -> udf1
>>
>> 2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2
>>
>> 3 - proj (1) -> filter -> distinct -> proj(1) -> udf3
>>
>> 4 - proj(1) -> distinct -> proj( * ) -> udf4
>>
>> 5 - proj(1) -> distinct -> proj(1) -> udf5
>>
>> In plans 2 and 3, filter will have an inner plan of:
>>
>> proj(0) -> gt, const(5) -> gt
>>
>> In discussing the scenario, Santhosh and I saw one issue, 
>> which is that in plan 1, the proj( * ) will be incorrectly 
>> trying to accumulate a bag for udf1, when it should just pass 
>> the tuple.  Santhosh is going to fix that by changing the 
>> project to determine whether it has a predecessor, and if so 
>> whether that predecessor is a relational operator, instead of 
>> looking at its input to see if it's a relational operator.
>>
>> I didn't follow your comment on the issue with the project in 
>> the filter plan.  It looked fine to me.
>>
>> Alan.
>>
>>

RE: The plan generated for this nested plan is not as per we had discussed

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Does this mean that distinct and filter will be recomputed several
times?

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Monday, June 30, 2008 11:21 AM
> To: Shravan Narayanamurthy
> Cc: Santhosh Srinivasan; pig-dev@incubator.apache.org
> Subject: Re: The plan generated for this nested plan is not 
> as per we had discussed
> 
> Analysis below.
> 
> Shravan M Narayanamurthy wrote:
> > Hi Guys,
> > I think we need to find a proper set of rules for the project's 
> > schema. The following script kinda of covers all the scenarios:
> > A = load 'a';
> > B = group A by $0;
> > C = foreach B {
> > C1 = filter A by $0>5;
> > C2 = distinct C1;
> > C3 = distinct A;
> > generate group, udf1(*), udf2(C2), udf3(C2.$1), udf4(C3), 
> udf(C3.$1); 
> > }
> >
> > I think, we had not thought about the projection in the 
> inner plan of 
> > filter. With this constraint, we need a new set of rules. 
> Can you post 
> > an algorithm that will work to set the return types of the projects?
> >
> > Thanks & Regards,
> > --Shravan
> >
> > <snip>
> In this case, the foreach should have the following plans:
> 
> 0 - proj(0)
> 
> 1 - proj( * ) -> udf1
> 
> 2 - proj (1) -> filter -> distinct -> proj( * ) -> udf2
> 
> 3 - proj (1) -> filter -> distinct -> proj(1) -> udf3
> 
> 4 - proj(1) -> distinct -> proj( * ) -> udf4
> 
> 5 - proj(1) -> distinct -> proj(1) -> udf5
> 
> In plans 2 and 3, filter will have an inner plan of:
> 
> proj(0) -> gt, const(5) -> gt
> 
> In discussing the scenario, Santhosh and I saw one issue, 
> which is that in plan 1, the proj( * ) will be incorrectly 
> trying to accumulate a bag for udf1, when it should just pass 
> the tuple.  Santhosh is going to fix that by changing the 
> project to determine whether it has a predecessor, and if so 
> whether that predecessor is a relational operator, instead of 
> looking at its input to see if it's a relational operator.
> 
> I didn't follow your comment on the issue with the project in 
> the filter plan.  It looked fine to me.
> 
> Alan.
>