You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Santhosh Srinivasan <sm...@yahoo-inc.com> on 2008/05/03 02:53:40 UTC

Implicit Split

Pig currently allows implicit splits within the foreach block. An
example that illustrates this behaviour follows:

    A = load 'input1';
    B = load 'input2';
    C = cogroup A by $0, B by $0;
    D = foreach C do {
        XX = filter A by $0 > 5;
        XY = filter B by $0 > 5; //at this point, there is an implicit
split in the foreach plan
        generate XX.$1, XY.$1; //here the generate needs to handle the
merge as its inputs are from XX and XY
    }

Notice that there is an implicit split in the foreach plan. Each input
tuple from C has to be piped to XX and XY. The generate has to now
handle the merge as both XX and XY serve as inputs. The inputs to
generate are now a DAG and not a tree.

Generate
/	\
XX	XY
\	/
Foreach

This makes the execution pipeline fairly complex. Should we restrict the
usage to not allow DAGs as input to the generate?


Thoughts?

Thanks,
Santhosh

Re: Implicit Split

Posted by pi song <pi...@gmail.com>.
Conceptually, if we could generalize our nested data processing model up to
recursive definition (partly done through the introduction of inner plans),
this problematic inner plan can be constructed easily by applying the
invariant plan compilation logic.  This sounds cool right? I really want to
see how far we can go (the point where theories meet practical world).

Back to your question, I want to see the new execution engine working as
soon as possible so I agree with you that we don't have to support this for
the time being (This use case is not quite common). I think it shouldn't be
too difficult to add this functionality later based on our current inner
plan design.

BTW, let's see what other people think.

Pi

On Sat, May 3, 2008 at 10:53 AM, Santhosh Srinivasan <sm...@yahoo-inc.com>
wrote:

> Pig currently allows implicit splits within the foreach block. An
> example that illustrates this behaviour follows:
>
>    A = load 'input1';
>    B = load 'input2';
>    C = cogroup A by $0, B by $0;
>    D = foreach C do {
>        XX = filter A by $0 > 5;
>        XY = filter B by $0 > 5; //at this point, there is an implicit
> split in the foreach plan
>        generate XX.$1, XY.$1; //here the generate needs to handle the
> merge as its inputs are from XX and XY
>    }
>
> Notice that there is an implicit split in the foreach plan. Each input
> tuple from C has to be piped to XX and XY. The generate has to now
> handle the merge as both XX and XY serve as inputs. The inputs to
> generate are now a DAG and not a tree.
>
> Generate
> /       \
> XX      XY
> \       /
> Foreach
>
> This makes the execution pipeline fairly complex. Should we restrict the
> usage to not allow DAGs as input to the generate?
>
>
> Thoughts?
>
> Thanks,
> Santhosh
>

Re: Implicit Split

Posted by pi song <pi...@gmail.com>.
Mridul,

By design, we are heading to the point where all (or nearly all) operators
are supported in nested queries plus they will not be limited to only 1
nested level.

Pi

On Mon, May 5, 2008 at 4:48 PM, Mridul Muralidharan <mr...@yahoo-inc.com>
wrote:

>
> This is something which is quite heavily (atleast by our team).
> I was hoping this would be expanded - like add support for  nested
> statement support in FILTER also (like in FOREACH), for example : currently
> we have to hack using FOREACH & flags statements to functionality since
> FILTER does not support it.
>
> Regards,
> Mridul
>
>
> Santhosh Srinivasan wrote:
>
> > Pig currently allows implicit splits within the foreach block. An
> > example that illustrates this behaviour follows:
> >
> >    A = load 'input1';
> >    B = load 'input2';
> >    C = cogroup A by $0, B by $0;
> >    D = foreach C do {
> >        XX = filter A by $0 > 5;
> >        XY = filter B by $0 > 5; //at this point, there is an implicit
> > split in the foreach plan
> >        generate XX.$1, XY.$1; //here the generate needs to handle the
> > merge as its inputs are from XX and XY
> >    }
> >
> > Notice that there is an implicit split in the foreach plan. Each input
> > tuple from C has to be piped to XX and XY. The generate has to now
> > handle the merge as both XX and XY serve as inputs. The inputs to
> > generate are now a DAG and not a tree.
> >
> > Generate
> > /       \
> > XX      XY
> > \       /
> > Foreach
> >
> > This makes the execution pipeline fairly complex. Should we restrict the
> > usage to not allow DAGs as input to the generate?
> >
> >
> > Thoughts?
> >
> > Thanks,
> > Santhosh
> >
>
>

Re: Implicit Split

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
This is something which is quite heavily (atleast by our team).
I was hoping this would be expanded - like add support for  nested 
statement support in FILTER also (like in FOREACH), for example : 
currently we have to hack using FOREACH & flags statements to 
functionality since FILTER does not support it.

Regards,
Mridul

Santhosh Srinivasan wrote:
> Pig currently allows implicit splits within the foreach block. An
> example that illustrates this behaviour follows:
> 
>     A = load 'input1';
>     B = load 'input2';
>     C = cogroup A by $0, B by $0;
>     D = foreach C do {
>         XX = filter A by $0 > 5;
>         XY = filter B by $0 > 5; //at this point, there is an implicit
> split in the foreach plan
>         generate XX.$1, XY.$1; //here the generate needs to handle the
> merge as its inputs are from XX and XY
>     }
> 
> Notice that there is an implicit split in the foreach plan. Each input
> tuple from C has to be piped to XX and XY. The generate has to now
> handle the merge as both XX and XY serve as inputs. The inputs to
> generate are now a DAG and not a tree.
> 
> Generate
> /	\
> XX	XY
> \	/
> Foreach
> 
> This makes the execution pipeline fairly complex. Should we restrict the
> usage to not allow DAGs as input to the generate?
> 
> 
> Thoughts?
> 
> Thanks,
> Santhosh