You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Keith Wiley <kw...@keithwiley.com> on 2013/04/02 00:45:42 UTC

Need rank(), can't build m6d's version

I need rank() in Hive.  I have't had much luck with Edward Capriolo's on git and it comes with no documentation.  It depends on hive-test (also by Edward) and I can't get maven to build the tests for hive-test which prevents me from building the hive-test package (the jar)...which is all pretty tangential to the real goal, which is rank of course.

Is there any place to download the hive-rank jar pre-built instead of having to build it out of git?  If not, has anyone built it recently, or namely has anyone built the hive-test project it depends on?  The following is just one example of the sort of error I see.  Remember this error is on hive-test, which frankly I don't even care about.  :-)

Feeling a little desperate...thanks for any help.

-------------------------------------------------------------------------------
Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest
-------------------------------------------------------------------------------
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec <<< FAILURE!
testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed: 7.823 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<9>
	at org.junit.Assert.fail(Assert.java:91)
	at org.junit.Assert.failNotEquals(Assert.java:645)
	at org.junit.Assert.assertEquals(Assert.java:126)
	at org.junit.Assert.assertEquals(Assert.java:470)
	at org.junit.Assert.assertEquals(Assert.java:454)
	at com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at junit.framework.TestCase.runBare(TestCase.java:134)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at junit.framework.TestSuite.runTest(TestSuite.java:243)
	at junit.framework.TestSuite.run(TestSuite.java:238)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
	at $Proxy0.invoke(Unknown Source)
	at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
	at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
I'm pretty sure there are at least a few bugs in that article.  The problem described below goes away if "rank" is added to the outer select statement.  If there's one error I suppose there may be others.  Although I have gotten the query to run, I'm not getting properly ranked results yet.  I'm still working on it.

On Apr 2, 2013, at 10:03 , Keith Wiley wrote:

> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
>> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article?  I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way).  Here's what I get when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> user	string	
> category	string	
> value	int	
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1	cat1	1
> user1	cat1	2
> user1	cat1	3
> user1	cat2	10
> user1	cat2	20
> user1	cat2	30
> user2	cat1	11
> user2	cat1	21
> user2	cat1	31
> user2	cat2	5
> user2	cat2	6
> user2	cat2	7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>> FROM (
>> SELECT user, category, rank(user) as rank, value
>> FROM test
>> DISTRIBUTE BY user
>> SORT BY user, value desc
>> ) a
>> WHERE rank > 1
>> ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive> 
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio than
> when I entered."
>                                           --  Keith Wiley
> ________________________________________________________________________________
> 


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
                                           --  Edwin A. Abbott, Flatland
________________________________________________________________________________


Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
I suppose.  Like I said, I was just following the example on that website and it didn't seem to have any comments on it remarking on errors in the article.  Thus my confusion about the whole thing.  It is presented as being accurate and is old enough that any errors would have been caught and labeled in the comments long ago.  I also thought it was a little odd to name the function the same thing as the alias assigned to the result.  I'll try disambiguating that.

On Apr 2, 2013, at 10:22 , Nitin Pawar wrote:

> I am not sure about this 
> but you defined a rank function and then aliased the column as rank itself. Can this cause the issue? 
> 
> From the error it looks like it could not figure out which column is rank in the where clause "WHERE rank > 1" 
> 
> from Edward's blog the correct query looks similar  but his function is named as p_rank 
> 
> SELECT
>  category,country,product,sales,rank
> FROM (
>  SELECT
>    category,country,product,sales,
>    p_rank(category, country) rank
>  FROM (
>    SELECT
>      category,country,product,
>      sales
>    FROM p_rank_demo
>    DISTRIBUTE BY
>      category,country
>    SORT BY
>      category,country,sales desc) t1) t2
> WHERE rank <= 3
> 
> 
> On Tue, Apr 2, 2013 at 10:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> I did all that, I just didn't paste it into the email.  That isn't the problem.  Sorry for the confusion.
> 
> hive> add jar Rank.jar;
> hive> create temporary function rank as 'com.example.hive.udf.Rank';
> 
> BTW, small typo, the condition at the end uses less-then, not greater-then...obviously...but that isn't the problem either.
> 
> On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:
> 
> > I dont see you added rank in the classpath and defined rank function
> >
> > can you try add jar and define the function and try again
> >
> >
> >
> > On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> > On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> >
> > > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> >
> > Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article?  I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way).  Here's what I get when I almost perfectly duplicate that example:
> >
> > hive> describe test;
> > OK
> > user    string
> > category        string
> > value   int
> > Time taken: 0.082 seconds
> > ==================================================
> > hive> select * from test;
> > OK
> > user1   cat1    1
> > user1   cat1    2
> > user1   cat1    3
> > user1   cat2    10
> > user1   cat2    20
> > user1   cat2    30
> > user2   cat1    11
> > user2   cat1    21
> > user2   cat1    31
> > user2   cat2    5
> > user2   cat2    6
> > user2   cat2    7
> > Time taken: 0.202 seconds
> > ==================================================
> > hive> SELECT user, category, value
> >     > FROM (
> >     > SELECT user, category, rank(user) as rank, value
> >     > FROM test
> >     > DISTRIBUTE BY user
> >     > SORT BY user, value desc
> >     > ) a
> >     > WHERE rank > 1
> >     > ORDER BY user, rank;
> > FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2)
> > hive>
> >
> > ________________________________________________________________________________
> > Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> >
> > "What I primarily learned in grad school is how much I *don't* know.
> > Consequently, I left grad school with a higher ignorance to knowledge ratio than
> > when I entered."
> >                                            --  Keith Wiley
> > ________________________________________________________________________________
> >
> >
> >
> >
> > --
> > Nitin Pawar
> 
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "It's a fine line between meticulous and obsessive-compulsive and a slippery
> rope between obsessive-compulsive and debilitatingly slow."
>                                            --  Keith Wiley
> ________________________________________________________________________________
> 
> 
> 
> 
> -- 
> Nitin Pawar


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________


Re: Need rank()

Posted by Nitin Pawar <ni...@gmail.com>.
I am not sure about this
but you defined a rank function and then aliased the column as rank itself.
Can this cause the issue?

>From the error it looks like it could not figure out which column is rank
in the where clause "WHERE rank > 1"

from Edward's blog the correct query looks similar  but his function is
named as p_rank

SELECT
 category,country,product,sales,rank
FROM (
 SELECT
   category,country,product,sales,
   p_rank(category, country) rank
 FROM (
   SELECT
     category,country,product,
     sales
   FROM p_rank_demo
   DISTRIBUTE BY
     category,country
   SORT BY
     category,country,sales desc) t1) t2
WHERE rank <= 3


On Tue, Apr 2, 2013 at 10:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> I did all that, I just didn't paste it into the email.  That isn't the
> problem.  Sorry for the confusion.
>
> hive> add jar Rank.jar;
> hive> create temporary function rank as 'com.example.hive.udf.Rank';
>
> BTW, small typo, the condition at the end uses less-then, not
> greater-then...obviously...but that isn't the problem either.
>
> On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:
>
> > I dont see you added rank in the classpath and defined rank function
> >
> > can you try add jar and define the function and try again
> >
> >
> >
> > On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley <kw...@keithwiley.com>
> wrote:
> > On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> >
> > >
> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> >
> > Is there any possibility there is a bug in Ritesh Agrawal's query
> statement from that article?  I created a test table with the exact column
> names from the example in the article and used a minimally altered version
> of the command (I removed the where clause to simplify things a bit) and
> got an error which suggests there is something slightly wrong with the
> command (or perhaps the table has to be configured a special way).  Here's
> what I get when I almost perfectly duplicate that example:
> >
> > hive> describe test;
> > OK
> > user    string
> > category        string
> > value   int
> > Time taken: 0.082 seconds
> > ==================================================
> > hive> select * from test;
> > OK
> > user1   cat1    1
> > user1   cat1    2
> > user1   cat1    3
> > user1   cat2    10
> > user1   cat2    20
> > user1   cat2    30
> > user2   cat1    11
> > user2   cat1    21
> > user2   cat1    31
> > user2   cat2    5
> > user2   cat2    6
> > user2   cat2    7
> > Time taken: 0.202 seconds
> > ==================================================
> > hive> SELECT user, category, value
> >     > FROM (
> >     > SELECT user, category, rank(user) as rank, value
> >     > FROM test
> >     > DISTRIBUTE BY user
> >     > SORT BY user, value desc
> >     > ) a
> >     > WHERE rank > 1
> >     > ORDER BY user, rank;
> > FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or
> column reference 'rank': (possible column names are: _col0, _col1, _col2)
> > hive>
> >
> >
> ________________________________________________________________________________
> > Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
> >
> > "What I primarily learned in grad school is how much I *don't* know.
> > Consequently, I left grad school with a higher ignorance to knowledge
> ratio than
> > when I entered."
> >                                            --  Keith Wiley
> >
> ________________________________________________________________________________
> >
> >
> >
> >
> > --
> > Nitin Pawar
>
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "It's a fine line between meticulous and obsessive-compulsive and a
> slippery
> rope between obsessive-compulsive and debilitatingly slow."
>                                            --  Keith Wiley
>
> ________________________________________________________________________________
>
>


-- 
Nitin Pawar

Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
I did all that, I just didn't paste it into the email.  That isn't the problem.  Sorry for the confusion.

hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

BTW, small typo, the condition at the end uses less-then, not greater-then...obviously...but that isn't the problem either.

On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:

> I dont see you added rank in the classpath and defined rank function 
> 
> can you try add jar and define the function and try again 
> 
> 
> 
> On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
> > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article?  I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way).  Here's what I get when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> user    string
> category        string
> value   int
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1   cat1    1
> user1   cat1    2
> user1   cat1    3
> user1   cat2    10
> user1   cat2    20
> user1   cat2    30
> user2   cat1    11
> user2   cat1    21
> user2   cat1    31
> user2   cat2    5
> user2   cat2    6
> user2   cat2    7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>     > FROM (
>     > SELECT user, category, rank(user) as rank, value
>     > FROM test
>     > DISTRIBUTE BY user
>     > SORT BY user, value desc
>     > ) a
>     > WHERE rank > 1
>     > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio than
> when I entered."
>                                            --  Keith Wiley
> ________________________________________________________________________________
> 
> 
> 
> 
> -- 
> Nitin Pawar


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: Need rank()

Posted by Nitin Pawar <ni...@gmail.com>.
I dont see you added rank in the classpath and defined rank function

can you try add jar and define the function and try again



On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
>
> >
> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
>
> Is there any possibility there is a bug in Ritesh Agrawal's query
> statement from that article?  I created a test table with the exact column
> names from the example in the article and used a minimally altered version
> of the command (I removed the where clause to simplify things a bit) and
> got an error which suggests there is something slightly wrong with the
> command (or perhaps the table has to be configured a special way).  Here's
> what I get when I almost perfectly duplicate that example:
>
> hive> describe test;
> OK
> user    string
> category        string
> value   int
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1   cat1    1
> user1   cat1    2
> user1   cat1    3
> user1   cat2    10
> user1   cat2    20
> user1   cat2    30
> user2   cat1    11
> user2   cat1    21
> user2   cat1    31
> user2   cat2    5
> user2   cat2    6
> user2   cat2    7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>     > FROM (
>     > SELECT user, category, rank(user) as rank, value
>     > FROM test
>     > DISTRIBUTE BY user
>     > SORT BY user, value desc
>     > ) a
>     > WHERE rank > 1
>     > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or
> column reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge
> ratio than
> when I entered."
>                                            --  Keith Wiley
>
> ________________________________________________________________________________
>
>


-- 
Nitin Pawar

Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
On Apr 2, 2013, at 12:59 , Edward Capriolo wrote:

> I feel your pain I do not have time to look at the project ATM. However trying to do Java development these days without understanding git, or maven is fairly impossible, most projects depend on N others that depend on M others. So trying to even compile the code outside of maven requires a lot of manually fetching jar etc. Your better off taking a day and getting good with maven, IMHO. You will be dealing with it for the foreseeable future. 


I agree.  I'll cozy up with it first chance I get...but sometimes you're on a deadline and have to go for the low-hanging fruit.  I fully understand the motivation behind your approach.  I'll circle back around it when I get some free time.

Cheers!

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Re: Need rank()

Posted by Edward Capriolo <ed...@gmail.com>.
I feel your pain I do not have time to look at the project ATM. However
trying to do Java development these days without understanding git, or
maven is fairly impossible, most projects depend on N others that depend on
M others. So trying to even compile the code outside of maven requires a
lot of manually fetching jar etc. Your better off taking a day and getting
good with maven, IMHO. You will be dealing with it for the foreseeable
future.

Re: Need rank()

Posted by Edward Capriolo <ed...@gmail.com>.
git clone https://github.com/edwardcapriolo/hive_test.git
cd hive_test/
mvn -Dmaven.test.skip=true install
cd ..
[edward@jackintosh java]$ git clone
https://github.com/edwardcapriolo/hive-rank.git
Cloning into 'hive-rank'...
remote: Counting objects: 74, done.
remote: Compressing objects: 100% (35/35), done.
remote: Total 74 (delta 12), reused 70 (delta 8)
Unpacking objects: 100% (74/74), done.
[edward@jackintosh java]$ cd hive-rank/
[edward@jackintosh hive-rank]$ mvn install -Dmaven.test.skip=true
/usr/java/jdk1.7.0_13
...
[INFO] Installing
/home/edward/Documents/java/hive-rank/target/hive-rank-1.0.0-SNAPSHOT.jar
to
/home/edward/.m2/repository/com/m6d/hive-rank/1.0.0-SNAPSHOT/hive-rank-1.0.0-SNAPSHOT.jar
[INFO] Installing /home/edward/Documents/java/hive-rank/pom.xml to
/home/edward/.m2/repository/com/m6d/hive-rank/1.0.0-SNAPSHOT/hive-rank-1.0.0-SNAPSHOT.pom
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 2.986s
[INFO] Finished at: Tue Apr 02 16:11:41 EDT 2013
[INFO] Final Memory: 17M/210M
[INFO]
------------------------------------------------------------------------

Then copy:
cp
/home/edward/Documents/java/hive-rank/target/hive-rank-1.0.0-SNAPSHOT.jar
to
/home/edward/.m2/repository/com/m6d/hive-rank/1.0.0-SNAPSHOT/hive-rank-1.0.0-SNAPSHOT.jar
to your hadoop lib.

add jar <name of jar file>
..... etc etc


On Tue, Apr 2, 2013 at 3:51 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> Yep, the original article is definitely erroneous in this regard.  I
> figured out that eventually.  I'm not sure how much I can trust that
> resource now.  I may have to look elsewhere.  I agree that Edward's
> description is pretty good, but as I said earlier, I can't actually use his
> code, so I'm trying to cobble a workable solution together from the various
> resources available.  Ritesh's article, despite the error in the Hive
> syntax, is still useful in that it enables one to quickly compile a simple
> rank jar without relying on git, maven, or other project dependencies --
> problems which have plagued me with Edward's approach.  So, if I can use
> Ritesh's method to write a simple rank function, and Edward's accurate
> description of how to construct the query, then I can put all the pieces
> together into a workable solution.
>
> I'll let you know if I get it.
>
> On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote:
>
> > You are getting the error because you are ORDERing BY rank but rank is
> not in the top SELECT
> >
> > Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a
> subquery:
> > SELECT ..., rank(user)
> > FROM (SELECT ... DISTRIBUTE BY ... SORT BY)
> >
> > igor
> > decide.com
> >
> >
> > On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley <kw...@keithwiley.com>
> wrote:
> > On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> >
> > >
> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> >
> > Is there any possibility there is a bug in Ritesh Agrawal's query
> statement from that article?  I created a test table with the exact column
> names from the example in the article and used a minimally altered version
> of the command (I removed the where clause to simplify things a bit) and
> got an error which suggests there is something slightly wrong with the
> command (or perhaps the table has to be configured a special way).  Here's
> what I get when I almost perfectly duplicate that example:
> >
> > hive> describe test;
> > OK
> > user    string
> > category        string
> > value   int
> > Time taken: 0.082 seconds
> > ==================================================
> > hive> select * from test;
> > OK
> > user1   cat1    1
> > user1   cat1    2
> > user1   cat1    3
> > user1   cat2    10
> > user1   cat2    20
> > user1   cat2    30
> > user2   cat1    11
> > user2   cat1    21
> > user2   cat1    31
> > user2   cat2    5
> > user2   cat2    6
> > user2   cat2    7
> > Time taken: 0.202 seconds
> > ==================================================
> > hive> SELECT user, category, value
> >     > FROM (
> >     > SELECT user, category, rank(user) as rank, value
> >     > FROM test
> >     > DISTRIBUTE BY user
> >     > SORT BY user, value desc
> >     > ) a
> >     > WHERE rank > 1
> >     > ORDER BY user, rank;
> > FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or
> column reference 'rank': (possible column names are: _col0, _col1, _col2)
> > hive>
> >
> >
> ________________________________________________________________________________
> > Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
> >
> > "What I primarily learned in grad school is how much I *don't* know.
> > Consequently, I left grad school with a higher ignorance to knowledge
> ratio than
> > when I entered."
> >                                            --  Keith Wiley
> >
> ________________________________________________________________________________
> >
> >
>
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
>
> ________________________________________________________________________________
>
>

Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
Yep, the original article is definitely erroneous in this regard.  I figured out that eventually.  I'm not sure how much I can trust that resource now.  I may have to look elsewhere.  I agree that Edward's description is pretty good, but as I said earlier, I can't actually use his code, so I'm trying to cobble a workable solution together from the various resources available.  Ritesh's article, despite the error in the Hive syntax, is still useful in that it enables one to quickly compile a simple rank jar without relying on git, maven, or other project dependencies -- problems which have plagued me with Edward's approach.  So, if I can use Ritesh's method to write a simple rank function, and Edward's accurate description of how to construct the query, then I can put all the pieces together into a workable solution.

I'll let you know if I get it.

On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote:

> You are getting the error because you are ORDERing BY rank but rank is not in the top SELECT
> 
> Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a subquery:
> SELECT ..., rank(user)
> FROM (SELECT ... DISTRIBUTE BY ... SORT BY)
> 
> igor
> decide.com
> 
> 
> On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
> > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article?  I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way).  Here's what I get when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> user    string
> category        string
> value   int
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1   cat1    1
> user1   cat1    2
> user1   cat1    3
> user1   cat2    10
> user1   cat2    20
> user1   cat2    30
> user2   cat1    11
> user2   cat1    21
> user2   cat1    31
> user2   cat2    5
> user2   cat2    6
> user2   cat2    7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>     > FROM (
>     > SELECT user, category, rank(user) as rank, value
>     > FROM test
>     > DISTRIBUTE BY user
>     > SORT BY user, value desc
>     > ) a
>     > WHERE rank > 1
>     > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio than
> when I entered."
>                                            --  Keith Wiley
> ________________________________________________________________________________
> 
> 


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: Need rank()

Posted by Igor Tatarinov <ig...@decide.com>.
You are getting the error because you are ORDERing BY rank but rank is not
in the top SELECT

Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a
subquery:
SELECT ..., rank(user)
FROM (SELECT ... DISTRIBUTE BY ... SORT BY)

igor
decide.com


On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
>
> >
> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
>
> Is there any possibility there is a bug in Ritesh Agrawal's query
> statement from that article?  I created a test table with the exact column
> names from the example in the article and used a minimally altered version
> of the command (I removed the where clause to simplify things a bit) and
> got an error which suggests there is something slightly wrong with the
> command (or perhaps the table has to be configured a special way).  Here's
> what I get when I almost perfectly duplicate that example:
>
> hive> describe test;
> OK
> user    string
> category        string
> value   int
> Time taken: 0.082 seconds
> ==================================================
> hive> select * from test;
> OK
> user1   cat1    1
> user1   cat1    2
> user1   cat1    3
> user1   cat2    10
> user1   cat2    20
> user1   cat2    30
> user2   cat1    11
> user2   cat1    21
> user2   cat1    31
> user2   cat2    5
> user2   cat2    6
> user2   cat2    7
> Time taken: 0.202 seconds
> ==================================================
> hive> SELECT user, category, value
>     > FROM (
>     > SELECT user, category, rank(user) as rank, value
>     > FROM test
>     > DISTRIBUTE BY user
>     > SORT BY user, value desc
>     > ) a
>     > WHERE rank > 1
>     > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or
> column reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge
> ratio than
> when I entered."
>                                            --  Keith Wiley
>
> ________________________________________________________________________________
>
>

Re: Need rank()

Posted by Keith Wiley <kw...@keithwiley.com>.
On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article?  I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way).  Here's what I get when I almost perfectly duplicate that example:

hive> describe test;
OK
user	string	
category	string	
value	int	
Time taken: 0.082 seconds
==================================================
hive> select * from test;
OK
user1	cat1	1
user1	cat1	2
user1	cat1	3
user1	cat2	10
user1	cat2	20
user1	cat2	30
user2	cat1	11
user2	cat1	21
user2	cat1	31
user2	cat2	5
user2	cat2	6
user2	cat2	7
Time taken: 0.202 seconds
==================================================
hive> SELECT user, category, value
    > FROM (
    > SELECT user, category, rank(user) as rank, value
    > FROM test
    > DISTRIBUTE BY user
    > SORT BY user, value desc
    > ) a
    > WHERE rank > 1
    > ORDER BY user, rank;
FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2)
hive> 

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
                                           --  Keith Wiley
________________________________________________________________________________


Re: Need rank(), can't build m6d's version

Posted by Keith Wiley <kw...@keithwiley.com>.
Thanks!  That looks like it has serious potential.  I'll study up on it.

On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: Need rank(), can't build m6d's version

Posted by Alexander Pivovarov <ap...@gmail.com>.
http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/


On Mon, Apr 1, 2013 at 3:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> I need rank() in Hive.  I have't had much luck with Edward Capriolo's on
> git and it comes with no documentation.  It depends on hive-test (also by
> Edward) and I can't get maven to build the tests for hive-test which
> prevents me from building the hive-test package (the jar)...which is all
> pretty tangential to the real goal, which is rank of course.
>
> Is there any place to download the hive-rank jar pre-built instead of
> having to build it out of git?  If not, has anyone built it recently, or
> namely has anyone built the hive-test project it depends on?  The following
> is just one example of the sort of error I see.  Remember this error is on
> hive-test, which frankly I don't even care about.  :-)
>
> Feeling a little desperate...thanks for any help.
>
>
> -------------------------------------------------------------------------------
> Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest
>
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec
> <<< FAILURE!
> testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed:
> 7.823 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<9>
>         at org.junit.Assert.fail(Assert.java:91)
>         at org.junit.Assert.failNotEquals(Assert.java:645)
>         at org.junit.Assert.assertEquals(Assert.java:126)
>         at org.junit.Assert.assertEquals(Assert.java:470)
>         at org.junit.Assert.assertEquals(Assert.java:454)
>         at
> com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at junit.framework.TestCase.runTest(TestCase.java:168)
>         at junit.framework.TestCase.runBare(TestCase.java:134)
>         at junit.framework.TestResult$1.protect(TestResult.java:110)
>         at junit.framework.TestResult.runProtected(TestResult.java:128)
>         at junit.framework.TestResult.run(TestResult.java:113)
>         at junit.framework.TestCase.run(TestCase.java:124)
>         at junit.framework.TestSuite.runTest(TestSuite.java:243)
>         at junit.framework.TestSuite.run(TestSuite.java:238)
>         at
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
>         at
> org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
>         at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
>         at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
>         at $Proxy0.invoke(Unknown Source)
>         at
> org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
>         at
> org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
>         at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                            --  Galileo Galilei
>
> ________________________________________________________________________________
>
>

Re: Need rank(), can't build m6d's version

Posted by Keith Wiley <kw...@keithwiley.com>.
Thank you Edward.  I appreciate you trying to help out.  I'll see if I can get that to work.

On Apr 1, 2013, at 17:09 , Edward Capriolo wrote:

> approx steps
> git clone hive-test 
> mvn install
> git clone hive-rang
> mvn install
> 
> or just
> git clone hive-test
> mvn -Dmaven.test.skip=true install

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
                                           --  Mark Twain
________________________________________________________________________________


Re: Need rank(), can't build m6d's version

Posted by Edward Capriolo <ed...@gmail.com>.
approx steps
git clone hive-test
mvn install
git clone hive-rang
mvn install

or just
git clone hive-test
mvn -Dmaven.test.skip=true install



On Mon, Apr 1, 2013 at 6:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> I need rank() in Hive.  I have't had much luck with Edward Capriolo's on
> git and it comes with no documentation.  It depends on hive-test (also by
> Edward) and I can't get maven to build the tests for hive-test which
> prevents me from building the hive-test package (the jar)...which is all
> pretty tangential to the real goal, which is rank of course.
>
> Is there any place to download the hive-rank jar pre-built instead of
> having to build it out of git?  If not, has anyone built it recently, or
> namely has anyone built the hive-test project it depends on?  The following
> is just one example of the sort of error I see.  Remember this error is on
> hive-test, which frankly I don't even care about.  :-)
>
> Feeling a little desperate...thanks for any help.
>
>
> -------------------------------------------------------------------------------
> Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest
>
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec
> <<< FAILURE!
> testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed:
> 7.823 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<9>
>         at org.junit.Assert.fail(Assert.java:91)
>         at org.junit.Assert.failNotEquals(Assert.java:645)
>         at org.junit.Assert.assertEquals(Assert.java:126)
>         at org.junit.Assert.assertEquals(Assert.java:470)
>         at org.junit.Assert.assertEquals(Assert.java:454)
>         at
> com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at junit.framework.TestCase.runTest(TestCase.java:168)
>         at junit.framework.TestCase.runBare(TestCase.java:134)
>         at junit.framework.TestResult$1.protect(TestResult.java:110)
>         at junit.framework.TestResult.runProtected(TestResult.java:128)
>         at junit.framework.TestResult.run(TestResult.java:113)
>         at junit.framework.TestCase.run(TestCase.java:124)
>         at junit.framework.TestSuite.runTest(TestSuite.java:243)
>         at junit.framework.TestSuite.run(TestSuite.java:238)
>         at
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
>         at
> org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
>         at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
>         at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
>         at $Proxy0.invoke(Unknown Source)
>         at
> org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
>         at
> org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
>         at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                            --  Galileo Galilei
>
> ________________________________________________________________________________
>
>