You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Keith Wiley <kw...@keithwiley.com> on 2013/04/02 22:54:10 UTC

Rank(): here's how I did it, for better or worse...

I agree, it's probably best to use a better engineered approach such as Edward's.  In the meantime, if anyone would benefit from a walk-through of my direct approach, here it is.  It combines Ritesh's direct ultra-simplistic method with Edward's correct HiveQL syntax.  As would be expected, it is sensitive to shell vagaries that would be better managed by a combo system like git and maven...but it works.

========================================

Create Rank.java:
-----
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
 
public final class Rank extends UDF{
    private int counter;
    private String last_key;
    public int evaluate(final String key){
      if ( !key.equalsIgnoreCase(this.last_key) ) {
         this.counter = 0;
         this.last_key = key;
      }
      return this.counter++;
    }
}

========================================

Compile Rank.java to Rank.class.  Then bundle to Rank.jar.  Observe that the jar command is highly sensitive to the relative path to the .class file when naming the package inside the resulting .jar file:
-----
$ mkdir ./RankTempDir
$ javac -classpath $HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar -d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..

You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have to verify the name of the serde, exec, and core files in each directory.  The exact filenames are probably version specific.

Verify the package path in Rank.jar:
-----
$ jar -tvf Rank.jar

You should see 'com/example/hive/udf/Rank.class'.  If you see a different path, the package has not been properly represented in the jar w.r.t. its designation in the .java file.

========================================

Run hive and prepare the session to use the UDF:

$ hive
hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

You must either run hive from a directory containing Rank.jar or specify an alternate path to it in the "add" command.  Note that the Rank class's full package is specified in the "create" command and therefore must, logically, match the package in both the .java and the .jar files.

========================================

Consider a table named 'test' consisting of columns 'user', 'category', and 'value', containing the following data:

hive> select * from test;
user1	catA	1
user1	catB	11
user1	catC	111
user2	catA	222
user2	catB	22
user2	catC	2
user3	catA	3
user3	catB	5
user3	catC	4

So the top category for user1 is catC, for user2 is catA and for user3 is catB.  Say we want the top N valued categories for each user.  In the example below, N is 2 (it is indicated in the final WHERE clause).  Here is the format of the corresponding ranked query, and its result:

hive> SELECT user, category, value, ranked_col
FROM (
    SELECT user, category, value, rank(user) ranked_col
        FROM (
            SELECT user, category, value
            FROM test
        DISTRIBUTE BY user
        SORT BY user, value desc
    ) a
) b
WHERE ranked_col < 2
ORDER BY user, ranked_col;
...
[wait for Hive query and MapReduce job(s) to finish]
...
user1	catC	111	0
user1	catB	11	1
user2	catA	222	0
user2	catB	22	1
user3	catB	5	0
user3	catC	4	1

Note that ranks are 0-indexed (of course I suppose that's a property of the specific .java we wrote above, and therefore is easily amenable to 1-indexing, which would more canonically connote the notion of a "rank").

Anyway, that's what I came up with.  I don't by any means claim it's the best approach.  Edward is surely right that the best method would be to use the powerful tools made available by the large developer community such as git and maven.

Cheers!

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________