You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Neil Kodner <nk...@gmail.com> on 2010/08/22 18:26:14 UTC
grouped top-n query in pig
I'm trying to perform a top-n query in pig. For example's sake, lets say my
input data is
(employeeid, departmentid, salary).
I'm trying to get the top n-highest-salaried employees of each department.
I would start off by grouping the data by department but am not sure how to
sort and limit the grouped data before flattening it into my output rows.
Re: grouped top-n query in pig
Posted by David Vrensk <da...@icehouse.se>.
On Sun, Aug 22, 2010 at 18:26, Neil Kodner <nk...@gmail.com> wrote:
> I'm trying to perform a top-n query in pig. For example's sake, lets say
> my
> input data is
> (employeeid, departmentid, salary).
>
> I'm trying to get the top n-highest-salaried employees of each department.
>
> I would start off by grouping the data by department but am not sure how to
> sort and limit the grouped data before flattening it into my output rows.
>
It's not immediately obvious what to do, but we came up with something that
solves the problem:
grouped = GROUP the_input BY (departmentid);
top_n =
FOREACH grouped {
ordered = ORDER the_input BY salary DESC;
limited = LIMIT ordered 20;
GENERATE FLATTEN(limited);
}
If this is ridiculously slow compared to another solution, please let me
know.
HTH,
/David
--
David Vrensk
Systems developer, ICE House AB
Mobile: +46 703 74 69 00