You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Benjamin Juhn <be...@gmail.com> on 2012/06/30 01:19:18 UTC
Group by Fetching top 100 from each group
Hi there,
I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
Thanks,
Ben
Re: Group by Fetching top 100 from each group
Posted by Corbin Hoenes <co...@tynt.com>.
http://pig.apache.org/docs/r0.10.0/func.html#topx
On Jun 29, 2012, at 5:19 PM, Benjamin Juhn wrote:
> Hi there,
>
> I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
>
> Thanks,
> Ben
Re: Group by Fetching top 100 from each group
Posted by Kris Coward <kr...@melon.org>.
Yes, that is indeed better.
On Fri, Jun 29, 2012 at 06:39:58PM -0700, Jonathan Coveney wrote:
> Ideally, you should use the TOP function. It will be more efficient, as it
> is algebraic.
>
> 2012/6/29 Kris Coward <kr...@melon.org>
>
> >
> > LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
> > These should be able to do what you want.
> >
> > e.g.
> >
> > B = GROUP A BY key
> > C = FOREACH B {
> > X = ORDER A BY orderingParam;
> > Y = LIMIT X 100;
> > GENERATE group, Y;}
> >
> > -Kris
> >
> > On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> > > Hi there,
> > >
> > > I'm trying to write a group by statement, only returning the top 100
> > records from each group. Does pig support this?
> > >
> > > Thanks,
> > > Ben
> >
> > --
> > Kris Coward http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> >
--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
Re: Group by Fetching top 100 from each group
Posted by Jonathan Coveney <jc...@gmail.com>.
Ideally, you should use the TOP function. It will be more efficient, as it
is algebraic.
2012/6/29 Kris Coward <kr...@melon.org>
>
> LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
> These should be able to do what you want.
>
> e.g.
>
> B = GROUP A BY key
> C = FOREACH B {
> X = ORDER A BY orderingParam;
> Y = LIMIT X 100;
> GENERATE group, Y;}
>
> -Kris
>
> On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> > Hi there,
> >
> > I'm trying to write a group by statement, only returning the top 100
> records from each group. Does pig support this?
> >
> > Thanks,
> > Ben
>
> --
> Kris Coward http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
>
Re: Group by Fetching top 100 from each group
Posted by Kris Coward <kr...@melon.org>.
LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
These should be able to do what you want.
e.g.
B = GROUP A BY key
C = FOREACH B {
X = ORDER A BY orderingParam;
Y = LIMIT X 100;
GENERATE group, Y;}
-Kris
On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> Hi there,
>
> I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
>
> Thanks,
> Ben
--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
Re: Group by Fetching top 100 from each group
Posted by Sal Uryasev <su...@linkedin.com>.
Hey Ben,
You can do a nested ORDER => LIMIT inside a FOREACH
http://pig.apache.org/docs/r0.10.0/basic.html#foreach
Newer versions of Pig also have a TOP function that will replace the ORDER => LIMIT.
-Sal
On Jun 29, 2012, at 4:19 PM, Benjamin Juhn wrote:
Hi there,
I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
Thanks,
Ben
RE: Group by Fetching top 100 from each group
Posted by Austin Stickney <as...@whitepages.com>.
You would want to do a FOREACH after the GROUP BY where you limit the contents of each group. Usually you would also want to order the bag before you limit it, so that you are taking the top 100 of something, rather than just a random selection of 100. Here's an example that creates a list of the top 100 salesmen from each state.
people = LOAD 'people.tsv' USING PigStorage() AS (
fname:chararray,
lname:chararray,
state:chararray,
sales:double
);
group_by_state = GROUP people BY state;
top_sales_by_state = FOREACH group_by_state {
order_by_sales = ORDER people BY sales DESC;
top_sales = LIMIT order_by_sales 100;
GENERATE
Group AS state:chararray,
FLATTEN(top_sales) AS (
fname:chararray,
lname:chararray,
state:chararray,
sales:double
)
;
};
Austin
-----Original Message-----
From: Benjamin Juhn [mailto:benjijuhn@gmail.com]
Sent: Friday, June 29, 2012 4:19 PM
To: pig-user@hadoop.apache.org
Subject: Group by Fetching top 100 from each group
Hi there,
I'm trying to write a group by statement, only returning the top 100 records from each group. Does pig support this?
Thanks,
Ben