You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kannan Shah <sh...@gmail.com> on 2012/09/17 21:55:29 UTC

STREAM in foreach block

I'm trying to group tuples by a key, sort by another key within each group,
and then pass the sorted list of tuples for each group to a perl script. I
need to use the perl script because I need to compute an aggregate quantity
that is dependent on the sort order, and I'm not much of a Java programmer,
so I don't know how to write a user-defined aggregate function.

Doing this requires me to use STREAM in a foreach block, after the GROUP
statement. Basically something like:

r2 = group r1 by key1 ;
r3 = foreach r2 {
   s1=r1;
   s2=order s1 by key2;
   s3=stream s2 through myperlscript as (x,y,z);
   generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
}
store r3 into "r3.out" using PigStorage(';');

NOTE: The FLATTENs are there only for syntactic reasons; myperlscript will
only output one tuple for each group.

I'm getting errors that make me think that you can use the STREAM operator
within a foreach block, but I'm not sure. Can someone confirm? Is there a
workaround to this sort of situation?

Any help appreciated,
Kannan

Re: STREAM in foreach block

Posted by Kannan Shah <sh...@gmail.com>.
Thanks for the reference, Dan.

Is there a workaround for this sort of situation, aside from writing a UDAF
in Java? Basically I want to compute an aggregate quantity that depends on
sort order, for tuples within a BY group. Any ideas?

On 17 September 2012 20:27, Dan Young <da...@gmail.com> wrote:

> I believe these are the ops supported in a nested foreach:
>
> CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY.
>
> See:
>
> http://pig.apache.org/docs/r0.10.0/basic.html#foreach
>  On Sep 17, 2012 1:55 PM, "Kannan Shah" <sh...@gmail.com> wrote:
>
> > I'm trying to group tuples by a key, sort by another key within each
> group,
> > and then pass the sorted list of tuples for each group to a perl script.
> I
> > need to use the perl script because I need to compute an aggregate
> quantity
> > that is dependent on the sort order, and I'm not much of a Java
> programmer,
> > so I don't know how to write a user-defined aggregate function.
> >
> > Doing this requires me to use STREAM in a foreach block, after the GROUP
> > statement. Basically something like:
> >
> > r2 = group r1 by key1 ;
> > r3 = foreach r2 {
> >    s1=r1;
> >    s2=order s1 by key2;
> >    s3=stream s2 through myperlscript as (x,y,z);
> >    generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
> > }
> > store r3 into "r3.out" using PigStorage(';');
> >
> > NOTE: The FLATTENs are there only for syntactic reasons; myperlscript
> will
> > only output one tuple for each group.
> >
> > I'm getting errors that make me think that you can use the STREAM
> operator
> > within a foreach block, but I'm not sure. Can someone confirm? Is there a
> > workaround to this sort of situation?
> >
> > Any help appreciated,
> > Kannan
> >
>



-- 
Kannan Shah

Analytical-Modeling Staff Scientist
Financial Services - Modeling
SAS Institute
San Diego

Detection-and-Estimation Group
Data Fusion Laboratory
Philadelphia

Re: STREAM in foreach block

Posted by Dan Young <da...@gmail.com>.
I believe these are the ops supported in a nested foreach:

CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY.

See:

http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 On Sep 17, 2012 1:55 PM, "Kannan Shah" <sh...@gmail.com> wrote:

> I'm trying to group tuples by a key, sort by another key within each group,
> and then pass the sorted list of tuples for each group to a perl script. I
> need to use the perl script because I need to compute an aggregate quantity
> that is dependent on the sort order, and I'm not much of a Java programmer,
> so I don't know how to write a user-defined aggregate function.
>
> Doing this requires me to use STREAM in a foreach block, after the GROUP
> statement. Basically something like:
>
> r2 = group r1 by key1 ;
> r3 = foreach r2 {
>    s1=r1;
>    s2=order s1 by key2;
>    s3=stream s2 through myperlscript as (x,y,z);
>    generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
> }
> store r3 into "r3.out" using PigStorage(';');
>
> NOTE: The FLATTENs are there only for syntactic reasons; myperlscript will
> only output one tuple for each group.
>
> I'm getting errors that make me think that you can use the STREAM operator
> within a foreach block, but I'm not sure. Can someone confirm? Is there a
> workaround to this sort of situation?
>
> Any help appreciated,
> Kannan
>