You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Benjamin Juhn <be...@gmail.com> on 2012/06/30 01:19:18 UTC

Group by Fetching top 100 from each group

Hi there,

I'm trying to write a group by statement, only returning the top 100 records from each group.  Does pig support this?

Thanks,
Ben

Re: Group by Fetching top 100 from each group

Posted by Corbin Hoenes <co...@tynt.com>.
http://pig.apache.org/docs/r0.10.0/func.html#topx


On Jun 29, 2012, at 5:19 PM, Benjamin Juhn wrote:

> Hi there,
> 
> I'm trying to write a group by statement, only returning the top 100 records from each group.  Does pig support this?
> 
> Thanks,
> Ben


Re: Group by Fetching top 100 from each group

Posted by Kris Coward <kr...@melon.org>.
Yes, that is indeed better.

On Fri, Jun 29, 2012 at 06:39:58PM -0700, Jonathan Coveney wrote:
> Ideally, you should use the TOP function. It will be more efficient, as it
> is algebraic.
> 
> 2012/6/29 Kris Coward <kr...@melon.org>
> 
> >
> > LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
> > These should be able to do what you want.
> >
> > e.g.
> >
> > B = GROUP A BY key
> > C = FOREACH B {
> >    X = ORDER A BY orderingParam;
> >    Y = LIMIT X 100;
> >    GENERATE group, Y;}
> >
> > -Kris
> >
> > On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> > > Hi there,
> > >
> > > I'm trying to write a group by statement, only returning the top 100
> > records from each group.  Does pig support this?
> > >
> > > Thanks,
> > > Ben
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Group by Fetching top 100 from each group

Posted by Jonathan Coveney <jc...@gmail.com>.
Ideally, you should use the TOP function. It will be more efficient, as it
is algebraic.

2012/6/29 Kris Coward <kr...@melon.org>

>
> LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
> These should be able to do what you want.
>
> e.g.
>
> B = GROUP A BY key
> C = FOREACH B {
>    X = ORDER A BY orderingParam;
>    Y = LIMIT X 100;
>    GENERATE group, Y;}
>
> -Kris
>
> On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> > Hi there,
> >
> > I'm trying to write a group by statement, only returning the top 100
> records from each group.  Does pig support this?
> >
> > Thanks,
> > Ben
>
> --
> Kris Coward                                     http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Re: Group by Fetching top 100 from each group

Posted by Kris Coward <kr...@melon.org>.
LIMIT and ORDER BY are both allowed nested ops for a FOREACH statement.
These should be able to do what you want.

e.g. 

B = GROUP A BY key
C = FOREACH B {
    X = ORDER A BY orderingParam;
    Y = LIMIT X 100;
    GENERATE group, Y;}

-Kris

On Fri, Jun 29, 2012 at 04:19:18PM -0700, Benjamin Juhn wrote:
> Hi there,
> 
> I'm trying to write a group by statement, only returning the top 100 records from each group.  Does pig support this?
> 
> Thanks,
> Ben

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Group by Fetching top 100 from each group

Posted by Sal Uryasev <su...@linkedin.com>.
Hey Ben,

You can do a nested ORDER => LIMIT inside a FOREACH
http://pig.apache.org/docs/r0.10.0/basic.html#foreach

Newer versions of Pig also have a TOP function that will replace the ORDER => LIMIT.

-Sal

On Jun 29, 2012, at 4:19 PM, Benjamin Juhn wrote:

Hi there,

I'm trying to write a group by statement, only returning the top 100 records from each group.  Does pig support this?

Thanks,
Ben


RE: Group by Fetching top 100 from each group

Posted by Austin Stickney <as...@whitepages.com>.
You would want to do a FOREACH after the GROUP BY where you limit the contents of each group. Usually you would also want to order the bag before you limit it, so that you are taking the top 100 of something, rather than just a random selection of 100. Here's an example that creates a list of the top 100 salesmen from each state.

people = LOAD 'people.tsv' USING PigStorage() AS (
  fname:chararray,
  lname:chararray,
  state:chararray,
  sales:double
);

group_by_state = GROUP people BY state;

top_sales_by_state = FOREACH group_by_state {
  order_by_sales = ORDER people BY sales DESC;
  top_sales = LIMIT order_by_sales 100;

  GENERATE
    Group AS state:chararray,
    FLATTEN(top_sales) AS (
      fname:chararray,
      lname:chararray,
      state:chararray,
      sales:double
    )
  ;
};

Austin

-----Original Message-----
From: Benjamin Juhn [mailto:benjijuhn@gmail.com] 
Sent: Friday, June 29, 2012 4:19 PM
To: pig-user@hadoop.apache.org
Subject: Group by Fetching top 100 from each group

Hi there,

I'm trying to write a group by statement, only returning the top 100 records from each group.  Does pig support this?

Thanks,
Ben