You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Neil Kodner <nk...@gmail.com> on 2010/08/22 18:26:14 UTC

grouped top-n query in pig

I'm trying to perform a top-n query in pig.  For example's sake, lets say my
input data is
(employeeid, departmentid, salary).

I'm trying to get the top n-highest-salaried employees of each department.

I would start off by grouping the data by department but am not sure how to
sort and limit the grouped data before flattening it into my output rows.

Re: grouped top-n query in pig

Posted by David Vrensk <da...@icehouse.se>.
On Sun, Aug 22, 2010 at 18:26, Neil Kodner <nk...@gmail.com> wrote:

> I'm trying to perform a top-n query in pig.  For example's sake, lets say
> my
> input data is
> (employeeid, departmentid, salary).
>
> I'm trying to get the top n-highest-salaried employees of each department.
>
> I would start off by grouping the data by department but am not sure how to
> sort and limit the grouped data before flattening it into my output rows.
>

It's not immediately obvious what to do, but we came up with something that
solves the problem:

grouped = GROUP the_input BY (departmentid);

top_n =
  FOREACH grouped {
    ordered = ORDER the_input BY salary DESC;
    limited = LIMIT ordered 20;
    GENERATE FLATTEN(limited);
  }

If this is ridiculously slow compared to another solution, please let me
know.

HTH,

/David

-- 
David Vrensk
Systems developer, ICE House AB
Mobile: +46 703 74 69 00