You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by paradisehit <pa...@163.com> on 2008/10/24 11:50:49 UTC

Why we have 3 layer palns!

 There are 3 plans in the Pig: LogicalPlan; PhysicalPlan and MROperPlan. Here is my understanding of these plans.

1. LogicalPlan: implemented by PIG.
    It is a Logical graph for the querys plan.
    Every time, the grunt get a query, it will generate a LogicalPlan for the current LogicalOperator. And when the current query is a STORE query, it will first create a LogicalPlan for the LogicalOperator LOStore, and then a clonePlan for the LOStore is generated through the dependencyOrderWalker.

2. PhysicalPlan: implemented by HExecutionEngine.
    It will translate the clone LogicaPlan into a PhysicalPlan which is much closer to the data, with each LogicalOperator(a node in the LogicalPlan graph) translated to a PhysicalOperator.

3. MROperPlan: implemented by mapReduceLayer.
    It is a MapReduce Plan which can translate to a Hadoop MapReduce Job. The PhysicalPlan will be compiled into a MapReduce Plan through MRCompiler.

The LogicalPlan didn't has relation whith the bottom platform(here is hadoop), and the PhysicalPlan and MROperPlan are implemented by the hadoop.

My doult is that, why we have 3 layer palns? Can't we use just 1 plan instead of the last 2 Plans ?
And I also wanna to know what's happened when the PhysicalPlan is translated into the MROperPlan.

Re:Re: Why we have 3 layer palns!

Posted by paradisehit <pa...@163.com>.

 
 
 
 


At 2008-10-25，"Alan Gates" <ga...@yahoo-inc.com> Wrote：
>You can think of the logical plan as a syntax tree.  In standard  
>language compilation the first step is to construct a syntax tree,  
>then you can do semantic checks (such as type checking, etc.), then  
>optimization, and then translation to machine code.  That's more or  
>less what we do, except translation to machine code for us is  
>translation to a physical plan.
>
>Keeping the logical plan separate from hadoop is intentional, as pig  
>is designed to run on multiple backends.
>
>So having two plans isn't odd or unique, but having three plans might  
>be.  The physical plan contains the operators pig will run.  The map  
>reduce plans describes how these operators will be run on hadoop.  So  
>most of the physical plan is broken up and placed into the operators  
>in the map reduce plan.  For example, consider the following pig script:
>
>A = load 'myfile';
>B = filter A by $0 > 0;
>C = group B by $0;
>D = foreach C generate group, COUNT(B);
>E = filter D by $1 > 1000;
>store E into 'output';

The logical plan is also like the physical plan:

Load -> Filter -> Group -> Foreach -> Filter -> Store

The Group is translated 3 physical operators the Local Rearrange, Global Rearrange and Package

>
>This will result in a physical plan of:
>
>Load -> Filter -> Group -> Foreach -> Filter -> Store
>
>which will get broken up into a map reduce plan
>
>Input:  myfile
>Map: Filter $0 > 0
>Sort key: B.$0
>Combiner: Foreach
>Reduce: Foreach -> Filter $1 > 1000;
>output:  output
>
>We could go directly from the logical plan to a map reduce plan with  
>physical operators in it.  We found it easier to think about this as  

What's the easier? The optimization? I think that both the Physical and MapReduce Plan are OperatorPlan。So Why not combine these two Plans into one? I just think the data-structure are the same, and the LogicalPlan also have enough information to generate the MapReduce Plans.

>a two step process where we construct a physical plan and then break  
>up that plan and place it in various map reduce jobs.  You could  
>think of this as a logical execution plan and a physical execution plan.
>
>Also, since pig can have multiple back ends, some common operations  
>such as filter will be the same on more than one backend, and there  
>will be a chance for code reuse.  For example, the local backend used  
>by the illustrate command reuses many of the physical operators from  
>the hadoop backend even though it does not use map reduce.
>
>Alan.
>
>On Oct 24, 2008, at 2:50 AM, paradisehit wrote:
>
>>  There are 3 plans in the Pig: LogicalPlan; PhysicalPlan and  
>> MROperPlan. Here is my understanding of these plans.
>>
>> 1. LogicalPlan: implemented by PIG.
>>     It is a Logical graph for the querys plan.
>>     Every time, the grunt get a query, it will generate a  
>> LogicalPlan for the current LogicalOperator. And when the current  
>> query is a STORE query, it will first create a LogicalPlan for the  
>> LogicalOperator LOStore, and then a clonePlan for the LOStore is  
>> generated through the dependencyOrderWalker.
>>
>> 2. PhysicalPlan: implemented by HExecutionEngine.
>>     It will translate the clone LogicaPlan into a PhysicalPlan  
>> which is much closer to the data, with each LogicalOperator(a node  
>> in the LogicalPlan graph) translated to a PhysicalOperator.
>>
>> 3. MROperPlan: implemented by mapReduceLayer.
>>     It is a MapReduce Plan which can translate to a Hadoop  
>> MapReduce Job. The PhysicalPlan will be compiled into a MapReduce  
>> Plan through MRCompiler.
>>
>> The LogicalPlan didn't has relation whith the bottom platform(here  
>> is hadoop), and the PhysicalPlan and MROperPlan are implemented by  
>> the hadoop.
>>
>> My doult is that, why we have 3 layer palns? Can't we use just 1  
>> plan instead of the last 2 Plans ?
>> And I also wanna to know what's happened when the PhysicalPlan is  
>> translated into the MROperPlan.
>>
>>
>>
>>
>

Re: Why we have 3 layer palns!

Posted by Alan Gates <ga...@yahoo-inc.com>.

You can think of the logical plan as a syntax tree.  In standard  
language compilation the first step is to construct a syntax tree,  
then you can do semantic checks (such as type checking, etc.), then  
optimization, and then translation to machine code.  That's more or  
less what we do, except translation to machine code for us is  
translation to a physical plan.

Keeping the logical plan separate from hadoop is intentional, as pig  
is designed to run on multiple backends.

So having two plans isn't odd or unique, but having three plans might  
be.  The physical plan contains the operators pig will run.  The map  
reduce plans describes how these operators will be run on hadoop.  So  
most of the physical plan is broken up and placed into the operators  
in the map reduce plan.  For example, consider the following pig script:

A = load 'myfile';
B = filter A by $0 > 0;
C = group B by $0;
D = foreach C generate group, COUNT(B);
E = filter D by $1 > 1000;
store E into 'output';

This will result in a physical plan of:

Load -> Filter -> Group -> Foreach -> Filter -> Store

which will get broken up into a map reduce plan

Input:  myfile
Map: Filter $0 > 0
Sort key: B.$0
Combiner: Foreach
Reduce: Foreach -> Filter $1 > 1000;
output:  output

We could go directly from the logical plan to a map reduce plan with  
physical operators in it.  We found it easier to think about this as  
a two step process where we construct a physical plan and then break  
up that plan and place it in various map reduce jobs.  You could  
think of this as a logical execution plan and a physical execution plan.

Also, since pig can have multiple back ends, some common operations  
such as filter will be the same on more than one backend, and there  
will be a chance for code reuse.  For example, the local backend used  
by the illustrate command reuses many of the physical operators from  
the hadoop backend even though it does not use map reduce.

Alan.

On Oct 24, 2008, at 2:50 AM, paradisehit wrote:

>  There are 3 plans in the Pig: LogicalPlan; PhysicalPlan and  
> MROperPlan. Here is my understanding of these plans.
>
> 1. LogicalPlan: implemented by PIG.
>     It is a Logical graph for the querys plan.
>     Every time, the grunt get a query, it will generate a  
> LogicalPlan for the current LogicalOperator. And when the current  
> query is a STORE query, it will first create a LogicalPlan for the  
> LogicalOperator LOStore, and then a clonePlan for the LOStore is  
> generated through the dependencyOrderWalker.
>
> 2. PhysicalPlan: implemented by HExecutionEngine.
>     It will translate the clone LogicaPlan into a PhysicalPlan  
> which is much closer to the data, with each LogicalOperator(a node  
> in the LogicalPlan graph) translated to a PhysicalOperator.
>
> 3. MROperPlan: implemented by mapReduceLayer.
>     It is a MapReduce Plan which can translate to a Hadoop  
> MapReduce Job. The PhysicalPlan will be compiled into a MapReduce  
> Plan through MRCompiler.
>
> The LogicalPlan didn't has relation whith the bottom platform(here  
> is hadoop), and the PhysicalPlan and MROperPlan are implemented by  
> the hadoop.
>
> My doult is that, why we have 3 layer palns? Can't we use just 1  
> plan instead of the last 2 Plans ?
> And I also wanna to know what's happened when the PhysicalPlan is  
> translated into the MROperPlan.
>
>
>
>