You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Santhosh Srinivasan (JIRA)" <ji...@apache.org> on 2008/06/06 03:02:49 UTC

[jira] Issue Comment Edited: (PIG-158) Rework logical plan

    [ https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602870#action_12602870 ] 

sms edited comment on PIG-158 at 6/5/08 6:02 PM:
-----------------------------------------------------------------

Eliminating the Generate Operator

It has been recommended earlier (Thanks Pi) that we eliminate the Generate operator in the Foreach ... Generate context.

In the types branch, we have a Generate operator (on the logical and physical side) that is a container for the expressions that are projected. The Generate operator wraps each operator inside a nested plan. The resulting list of plans can be a mixture of expressions that derive their input from generate's predecessor or directly from the foreach input. Examples that illustrate these points follow.

{code}

--Example 1

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate group, sum(d.$1);
}

{code}

Logical plan after parsing:

{noformat}

ForEach Test-Plan-Builder-655
|   |
|   Generate Test-Plan-Builder-654
|   |   |
|   |   Project Test-Plan-Builder-650
|   |   |
|   |   UserFunc Test-Plan-Builder-653
|   |   |
|   |   |---Project Test-Plan-Builder-652
|   |
|   |---Distinct Test-Plan-Builder-649
|       |
|       |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
    |   |
    |   Project Test-Plan-Builder-646
    |
    |---Load Test-Plan-Builder-645

{noformat}

The Generate operator has 2 nested plans, one for the Project(group, b) and the other for the aggregate (sum). There are a couple of points to observe:

1. The projection of 'group' does not require the input 'd'. 
2. The root of the second plan Project(1, project(d, b)) requires the input 'd' which is connected to Generate but not as input in the nested plan.

The former should be part of the Foreach operator and the latter is a problem on the physical side. When the getNext call is made for the root of the nested plan, the input from generate is sought whereas the input from Distinct (d) is required.

Let us look at another example. Here, input 'd' is used twice in the generate. This is a case of an implicit split. The output of 'd' has to be split to both the sum and the count.

{code}

--Example 2

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate sum(d.$1), count(d.$1);
}

{code}

In order to remove the Generate operator, the nested plans which are currently part of the Generate will be promoted to be a part of the Foreach operator with the following changes:

1. Any expression that is part of the generate (root of the nested plan) which does not require generate's input will be moved into a nested plan of Foreach.

2. The remaining expressions of generate will be attached as leaves of generate's input by duplicating the graph.

Going back to example 1, the logical plan for Foreach will have two nested plans. The first nested plan will contain Project(group, b). The second nested plan will have 'd' as the root and the aggregate function sum as the leaf

Example 2 will translate to two nested plans both of which will have 'd' as the input. The leaves of the individual plans will be the aggregate functions sum and count respectively.

      was (Author: sms):
    Eliminating the Generate Operator

It has been recommended earlier (Thanks Pi) that we eliminate the Generate operator in the Foreach ... Generate context.

In the types branch, we have a Generate operator (on the logical and physical side) that is a container for the expressions that are projected. The Generate operator wraps each operator inside a nested plan. The resulting list of plans can be a mixture of expressions that derive their input from generate's predecessor or directly from the foreach input. Examples that illustrate these points follow.

{code}

--Example 1

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate group, sum(d.$1);
}

{code}

Logical plan after parsing:

ForEach Test-Plan-Builder-655
|   |
|   Generate Test-Plan-Builder-654
|   |   |
|   |   Project Test-Plan-Builder-650
|   |   |
|   |   UserFunc Test-Plan-Builder-653
|   |   |
|   |   |---Project Test-Plan-Builder-652
|   |
|   |---Distinct Test-Plan-Builder-649
|       |
|       |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
    |   |
    |   Project Test-Plan-Builder-646
    |
    |---Load Test-Plan-Builder-645


The Generate operator has 2 nested plans, one for the Project(group, b) and the other for the aggregate (sum). There are a couple of points to observe:

1. The projection of 'group' does not require the input 'd'. 
2. The root of the second plan Project(1, project(d, b)) requires the input 'd' which is connected to Generate but not as input in the nested plan.

The former should be part of the Foreach operator and the latter is a problem on the physical side. When the getNext call is made for the root of the nested plan, the input from generate is sought whereas the input from Distinct (d) is required.

Let us look at another example. Here, input 'd' is used twice in the generate. This is a case of an implicit split. The output of 'd' has to be split to both the sum and the count.

{code}

--Example 2

a = load 'input1';
b = group a by $0;
c = foreach b {
	d = distinct a;
	generate sum(d.$1), count(d.$1);
}

{code}

In order to remove the Generate operator, the nested plans which are currently part of the Generate will be promoted to be a part of the Foreach operator with the following changes:

1. Any expression that is part of the generate (root of the nested plan) which does not require generate's input will be moved into a nested plan of Foreach.

2. The remaining expressions of generate will be attached as leaves of generate's input by duplicating the graph.

Going back to example 1, the logical plan for Foreach will have two nested plans. The first nested plan will contain Project(group, b). The second nested plan will have 'd' as the root and the aggregate function sum as the leaf

Example 2 will translate to two nested plans both of which will have 'd' as the input. The leaves of the individual plans will be the aggregate functions sum and count respectively.
  
> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: is_null.patch, logical_operators.patch, logical_operators_rev_1.patch, logical_operators_rev_2.patch, logical_operators_rev_3.patch, parser_changes.patch, parser_changes_v1.patch, parser_changes_v2.patch, parser_changes_v3.patch, parser_changes_v4.patch, ParserErrors.txt, udf_fix.patch, udf_funcSpec.patch, udf_return_type.patch, user_func_and_store.patch, visitorWalker.patch
>
>
> Rework the logical plan in line with http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.