You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Gianmarco <gi...@gmail.com> on 2011/03/22 19:31:50 UTC

RANK implementation in Pig

Hi all,

I was thinking about a nice project for Pig for this years' GsoC.
One idea I had was to implement something similar to SQL Rank() function.
RANK() returns the rank of each row within the partition of a result set.
The rank of a row is one plus the number of ranks that come before the row
in question.
Basically it assigns a consecutive unique identifier to each row (a row id).

In my experience this is a very useful feature.
Of course, the naive solution would be to use 1 reducer and stamp each tuple
in a bag with an increasing id.
But there is an algorithm to do this in a parallel way (2 MR jobs).
The idea would be to add a new operator (as this cannot be done with a UDF)
that can rank bags/relations.
Of course this could be used in conjunction with ORDER BY to define the
specific rank order.

Do you see this as an interesting project?

Thanks,
--
Gianmarco De Francisci Morales