You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Erik Onnen <eo...@gmail.com> on 2011/01/19 19:31:29 UTC

Identifying the top 20% of a sorted relation

I have a sorted relation that appears similar to the following where users
have a consumption count:

(user1, 1000)
(user99, 999)
(user2, 998)
(user3, 22)
(user4, 10)
...

I'd like to identify the top 20% of users based on the second field. I'm
able to get the aggregate sum of the second field easy enough, but I'm not
able to get my head around a mechanism to pick out the users who are just in
the top 20%. Best I can tell, I'd need something like an accumulator that
increments for each of the tuples and stops when it reaches 20% of the total
sum but that doesn't seem possible.

Anybody done anything similar from within PIG?