You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Saurabh Nanda <sa...@gmail.com> on 2009/07/20 07:24:05 UTC

dense_rank() equivalent in Hive?

Hi,

Does Hive have any function equivalent to Oracle's dense_rank (
http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions038.htm)
function? Here's what I'm trying to do:

For each user, I want to sort the page views in ascending order of time, and
then number each page view with it's rank. This will enable me to easily
identify a particular user's first page view, second page view, and so on.

Is there any other way to approach this problem? If I can ensure that a
particular user's (sorted) data is guaranteed to be processed on a single
Hadoop node, then probably I can write a custom script to do the ranking for
me.

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

RE: dense_rank() equivalent in Hive?

Posted by Ashish Thusoo <at...@facebook.com>.

yes.

________________________________
From: Saurabh Nanda [mailto:saurabhnanda@gmail.com]
Sent: Sunday, July 19, 2009 11:38 PM
To: hive-user@hadoop.apache.org
Subject: Re: dense_rank() equivalent in Hive?

Is there any other way to approach this problem? If I can ensure that a particular user's (sorted) data is guaranteed to be processed on a single Hadoop node, then probably I can write a custom script to do the ranking for me.

I guess the answer to my query is given at http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy --

"Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required."

Saurabh.

--

http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: dense_rank() equivalent in Hive?

Posted by Saurabh Nanda <sa...@gmail.com>.

> Is there any other way to approach this problem? If I can ensure that a
> particular user's (sorted) data is guaranteed to be processed on a single
> Hadoop node, then probably I can write a custom script to do the ranking for
> me.


I guess the answer to my query is given at
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy --

"Hive uses the columns in *Distribute By* to distribute the rows among
reducers. All rows with the same *Distribute By* columns will go to the same
reducer. Instead of specifying *Cluster By*, the user can specify *Distribute
By* and *Sort By*, so the partition columns and sort columns can be
different. The usual case is that the partition columns are a prefix of sort
columns, but that is not required."

Saurabh.

-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com