You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kannan Shah <sh...@gmail.com> on 2012/09/17 21:55:29 UTC
STREAM in foreach block
I'm trying to group tuples by a key, sort by another key within each group,
and then pass the sorted list of tuples for each group to a perl script. I
need to use the perl script because I need to compute an aggregate quantity
that is dependent on the sort order, and I'm not much of a Java programmer,
so I don't know how to write a user-defined aggregate function.
Doing this requires me to use STREAM in a foreach block, after the GROUP
statement. Basically something like:
r2 = group r1 by key1 ;
r3 = foreach r2 {
s1=r1;
s2=order s1 by key2;
s3=stream s2 through myperlscript as (x,y,z);
generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
}
store r3 into "r3.out" using PigStorage(';');
NOTE: The FLATTENs are there only for syntactic reasons; myperlscript will
only output one tuple for each group.
I'm getting errors that make me think that you can use the STREAM operator
within a foreach block, but I'm not sure. Can someone confirm? Is there a
workaround to this sort of situation?
Any help appreciated,
Kannan
Re: STREAM in foreach block
Posted by Kannan Shah <sh...@gmail.com>.
Thanks for the reference, Dan.
Is there a workaround for this sort of situation, aside from writing a UDAF
in Java? Basically I want to compute an aggregate quantity that depends on
sort order, for tuples within a BY group. Any ideas?
On 17 September 2012 20:27, Dan Young <da...@gmail.com> wrote:
> I believe these are the ops supported in a nested foreach:
>
> CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY.
>
> See:
>
> http://pig.apache.org/docs/r0.10.0/basic.html#foreach
> On Sep 17, 2012 1:55 PM, "Kannan Shah" <sh...@gmail.com> wrote:
>
> > I'm trying to group tuples by a key, sort by another key within each
> group,
> > and then pass the sorted list of tuples for each group to a perl script.
> I
> > need to use the perl script because I need to compute an aggregate
> quantity
> > that is dependent on the sort order, and I'm not much of a Java
> programmer,
> > so I don't know how to write a user-defined aggregate function.
> >
> > Doing this requires me to use STREAM in a foreach block, after the GROUP
> > statement. Basically something like:
> >
> > r2 = group r1 by key1 ;
> > r3 = foreach r2 {
> > s1=r1;
> > s2=order s1 by key2;
> > s3=stream s2 through myperlscript as (x,y,z);
> > generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
> > }
> > store r3 into "r3.out" using PigStorage(';');
> >
> > NOTE: The FLATTENs are there only for syntactic reasons; myperlscript
> will
> > only output one tuple for each group.
> >
> > I'm getting errors that make me think that you can use the STREAM
> operator
> > within a foreach block, but I'm not sure. Can someone confirm? Is there a
> > workaround to this sort of situation?
> >
> > Any help appreciated,
> > Kannan
> >
>
--
Kannan Shah
Analytical-Modeling Staff Scientist
Financial Services - Modeling
SAS Institute
San Diego
Detection-and-Estimation Group
Data Fusion Laboratory
Philadelphia
Re: STREAM in foreach block
Posted by Dan Young <da...@gmail.com>.
I believe these are the ops supported in a nested foreach:
CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY.
See:
http://pig.apache.org/docs/r0.10.0/basic.html#foreach
On Sep 17, 2012 1:55 PM, "Kannan Shah" <sh...@gmail.com> wrote:
> I'm trying to group tuples by a key, sort by another key within each group,
> and then pass the sorted list of tuples for each group to a perl script. I
> need to use the perl script because I need to compute an aggregate quantity
> that is dependent on the sort order, and I'm not much of a Java programmer,
> so I don't know how to write a user-defined aggregate function.
>
> Doing this requires me to use STREAM in a foreach block, after the GROUP
> statement. Basically something like:
>
> r2 = group r1 by key1 ;
> r3 = foreach r2 {
> s1=r1;
> s2=order s1 by key2;
> s3=stream s2 through myperlscript as (x,y,z);
> generate group,flatten(s3.x),flatten(s3.y),flatten(s3.z);
> }
> store r3 into "r3.out" using PigStorage(';');
>
> NOTE: The FLATTENs are there only for syntactic reasons; myperlscript will
> only output one tuple for each group.
>
> I'm getting errors that make me think that you can use the STREAM operator
> within a foreach block, but I'm not sure. Can someone confirm? Is there a
> workaround to this sort of situation?
>
> Any help appreciated,
> Kannan
>