You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Iman Elghandour <ie...@yahoo.com> on 2009/03/23 03:40:43 UTC

implicit splits in multiquery plans

Hello,
I have just noticed that the implicit split is added in the wrong place in this plan. I am just examining the plan for the Pig script that is available in the jira issue: https://issues.apache.org/jira/browse/PIG-627

A = load 'data' as (a, b, c);
B = filter A by a > 5;
store B into 'output1';
C = group B by b;
store C into 'output2';

The plan logical plan is below. I think the split operator 
should be placed before the filter. And so the filter will 
be performed on only one branch not on both.

Store 1-14 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: Unknown
|
|---SplitOutput[B] 1-21 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
    |   |
    |   Const 1-20 FieldSchema: boolean Type: boolean
    |
    |---Split 1-19 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
        |
        |---Filter 1-13 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
            |   |
            |   GreaterThan 1-12 FieldSchema: boolean Type: boolean
            |   |
            |   |---Const 1-11 FieldSchema: int Type: int
            |   |
            |   |---Cast 1-18 FieldSchema: int Type: int
            |       |
            |       |---Project 1-10 Projections: [0] Overloaded: false FieldSchema: a: bytearray Type: bytearray
            |           Input: Load 1-9
            |
            |---Load 1-9 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag

Store 1-17 Schema: {group: bytearray,B: {a: bytearray,b: bytearray,c: bytearray}} Type: Unknown
|
|---CoGroup 1-16 Schema: {group: bytearray,B: {a: bytearray,b: bytearray,c: bytearray}} Type: bag
    |   |
    |   Project 1-15 Projections: [1] Overloaded: false FieldSchema: b: bytearray Type: bytearray
    |   Input: SplitOutput[B] 1-23
    |
    |---SplitOutput[B] 1-23 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
        |   |
        |   Const 1-22 FieldSchema: boolean Type: boolean
        |
        |---Split 1-19 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
            |
            |---Filter 1-13 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
                |   |
                |   GreaterThan 1-12 FieldSchema: boolean Type: boolean
                |   |
                |   |---Const 1-11 FieldSchema: int Type: int
                |   |
                |   |---Cast 1-18 FieldSchema: int Type: int
                |       |
                |       |---Project 1-10 Projections: [0] Overloaded: false FieldSchema: a: bytearray Type: bytearray
                |           Input: Load 1-9
                |
                |---Load 1-9 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag

Thanks,
Iman.




      __________________________________________________________________
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now at
http://ca.toolbar.yahoo.com.

Re: implicit splits in multiquery plans

Posted by Gunther Hagleitner <ha...@yahoo-inc.com>.
Hi,

I believe the split is in the right place. Both B and C need to have the
filter performed before they are stored. Also, the filter is only going to
be run once - load (1-9), filter (1-13) and split (1-19) are the same
operator in both paths. I've attached a graphical representation of the same
logical plan (which I think is easier to read).

If you wanted the filter to be performed only on the non-cogroup path, for
instance, the script would have to read:

A = load 'data' as (a, b, c);
B = filter A by a > 5;
store B into 'output1';
C = group A by b; -- Use pre-filter handle A, instead of B
store C into 'output2';

Thanks,
Gunther.

On 3/22/09 7:40 PM, "Iman Elghandour" <ie...@yahoo.com> wrote:

> Hello,
> I have just noticed that the implicit split is added in the wrong place in
> this plan. I am just examining the plan for the Pig script that is available
> in the jira issue: https://issues.apache.org/jira/browse/PIG-627
> 
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> 
> The plan logical plan is below. I think the split operator
> should be placed before the filter. And so the filter will
> be performed on only one branch not on both.
> 
> Store 1-14 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: Unknown
> |
> |---SplitOutput[B] 1-21 Schema: {a: bytearray,b: bytearray,c: bytearray} Type:
> bag
>     |   |
>     |   Const 1-20 FieldSchema: boolean Type: boolean
>     |
>     |---Split 1-19 Schema: {a: bytearray,b: bytearray,c: bytearray} Type: bag
>         |
>         |---Filter 1-13 Schema: {a: bytearray,b: bytearray,c: bytearray} Type:
> bag
>             |   |
>             |   GreaterThan 1-12 FieldSchema: boolean Type: boolean
>             |   |
>             |   |---Const 1-11 FieldSchema: int Type: int
>             |   |
>             |   |---Cast 1-18 FieldSchema: int Type: int
>             |       |
>             |       |---Project 1-10 Projections: [0] Overloaded: false
> FieldSchema: a: bytearray Type: bytearray
>             |           Input: Load 1-9
>             |
>             |---Load 1-9 Schema: {a: bytearray,b: bytearray,c: bytearray}
> Type: bag
> 
> Store 1-17 Schema: {group: bytearray,B: {a: bytearray,b: bytearray,c:
> bytearray}} Type: Unknown
> |
> |---CoGroup 1-16 Schema: {group: bytearray,B: {a: bytearray,b: bytearray,c:
> bytearray}} Type: bag
>     |   |
>     |   Project 1-15 Projections: [1] Overloaded: false FieldSchema: b:
> bytearray Type: bytearray
>     |   Input: SplitOutput[B] 1-23
>     |
>     |---SplitOutput[B] 1-23 Schema: {a: bytearray,b: bytearray,c: bytearray}
> Type: bag
>         |   |
>         |   Const 1-22 FieldSchema: boolean Type: boolean
>         |
>         |---Split 1-19 Schema: {a: bytearray,b: bytearray,c: bytearray} Type:
> bag
>             |
>             |---Filter 1-13 Schema: {a: bytearray,b: bytearray,c: bytearray}
> Type: bag
>                 |   |
>                 |   GreaterThan 1-12 FieldSchema: boolean Type: boolean
>                 |   |
>                 |   |---Const 1-11 FieldSchema: int Type: int
>                 |   |
>                 |   |---Cast 1-18 FieldSchema: int Type: int
>                 |       |
>                 |       |---Project 1-10 Projections: [0] Overloaded: false
> FieldSchema: a: bytearray Type: bytearray
>                 |           Input: Load 1-9
>                 |
>                 |---Load 1-9 Schema: {a: bytearray,b: bytearray,c: bytearray}
> Type: bag
> 
> Thanks,
> Iman.
> 
> 
> 
> 
>       __________________________________________________________________
> Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your
> favourite sites. Download it now at
> http://ca.toolbar.yahoo.com.