You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Peter Marron <Pe...@trilliumsoftware.com> on 2012/10/31 18:42:26 UTC

Creating Indexes

Hi,

I am still having problems building my index.
In an attempt to find someone who can help me
I'll go through all the steps that I try.


1)      First I load my data into hive.

hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;
Loading data to table default.score
Deleted hdfs://localhost/data/warehouse/score
OK
Time taken: 7.817 seconds


2)      Then I try to create the index

hive> CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
FAILED: Error in metadata: java.lang.RuntimeException: Please specify deferred rebuild using " WITH DEFERRED REBUILD ".
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>


3)      OK, so it suggests that I use "DEFERRED BUILD" and so I do
hive>
    >
    > CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > WITH DEFERRED REBUILD;
OK
Time taken: 0.603 seconds


4)      Now, to create the index I assume that I use ALTER INDEX as follows:

hive>ALTER INDEX bigIndex ON score REBUILD;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 138
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201210311448_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001
Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001
Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138
2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%


5)      This all looks promising, and after increasing my heapsize to get the Map/Reduce to complete, I get this an hour later

2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4135.47 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds 470 msec
Ended Job = job_201210311448_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

So what have I done wrong, and what am I to do to get this index to build successfully?

Any help appreciated.

Peter Marron

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 24 October 2012 13:27
To: user@hive.apache.org
Subject: RE: Indexes

Hi Shreepadma,

Thanks for this. Looks exactly like the information I need.
I was going to reply when I had tried it all out, but I'm having
problems creating the index at the moment (I'm getting an
OutOfMemoryError at the moment). So I thought that I had
better reply now to say thank you.

Peter Marron


From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 23 October 2012 19:49
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Indexes

Hi Peter,

Indexing support was added to Hive in 0.7 and in 0.8 the query compiler was enhanced to optimized some class of queries (certain group bys and joins) using indexes. Assuming you are using the built in index handler you need to do the following _after_ you have created and rebuilt the index,

SET hive.index.compact.file='/tmp/index_result';
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

You will then notice speed up for a query of the form,

select count(*) from tab where indexed_col = some_val

Thanks,
Shreepadma

On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I'm very much a Hive newbie but I've been looking at HIVE-417 and this page in particular:
http://cwiki.apache.org/confluence/display/Hive/IndexDev
Using this information I've been able to create an index (using Hive 0.8.1)
and when I look at the contents it all looks very promising indeed.
However on the same page there's this comment:

"...This document currently only covers index creation and maintenance. A follow-on will explain how indexes are used to optimize queries (building on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>)...."

However I can't find the "follow-on" which tells me how to exploit the index that I've
created to "optimize" subsequent queries.
Now I've been told that I can create and use indexes with the current
release of Hive _without_ writing and developing any Java code of my own.
Is this true? If so, how?

Any help appreciated.

Peter Marron.


Re: Creating Indexes

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
That's what I would expect, too, but it appears a reducer task wanted to
update some SQL statistics and to do that it wanted to load the derby jar.

On Thu, Nov 1, 2012 at 8:09 AM, Bejoy KS <be...@yahoo.com> wrote:

> **
> AFAIK you don't any hive jars on cluster. The hive jars are just required
> on the client node .
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Dean Wampler <de...@thinkbiganalytics.com>
> *Date: *Thu, 1 Nov 2012 08:01:51 -0500
> *To: *<us...@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Creating Indexes
>
> It looks like you're using Derby with a real cluster, not just a single
> machine in local or pseudo-distributed mode. I haven't tried this myself,
> but the derby jar is probably not on the machine that ran the reducer task
> that failed.
>
> dean
>
> On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:
>
>>  Hi Shreepadma,****
>>
>> ** **
>>
>> I agree that the error looks odd. However I can’t believe that I would
>> have****
>>
>> got this far with Hive if there was no derby jar. Nevertheless I checked.
>> ****
>>
>> Here is a directory listing of the Hive install:****
>>
>> ** **
>>
>> pmarron@pmarron-ubuntu:/data/hive/lib$ ls****
>>
>> ant-contrib-1.0b3.jar          commons-pool-1.5.4.jar
>> hive-common-0.8.1.jar         hive-shims-0.8.1.jar  mockito-all-1.8.2.jar
>> ****
>>
>> antlr-2.7.7.jar                datanucleus-connectionpool-2.0.3.jar
>> hive-contrib-0.8.1.jar        javaewah-0.3.jar      php****
>>
>> antlr-3.0.1.jar                datanucleus-core-2.0.3.jar
>> hive_contrib.jar              jdo2-api-2.3-ec.jar   py****
>>
>> antlr-runtime-3.0.1.jar        datanucleus-enhancer-2.0.3.jar
>> hive-exec-0.8.1.jar           jline-0.9.94.jar      slf4j-api-1.6.1.jar**
>> **
>>
>> asm-3.1.jar                    datanucleus-rdbms-2.0.3.jar
>> hive-hbase-handler-0.8.1.jar  json-20090211.jar     slf4j-log4j12-1.6.1.jar
>> ****
>>
>> commons-cli-1.2.jar            *derby-10.4.2.0.jar*
>> hive-hwi-0.8.1.jar            junit-4.10.jar
>> stringtemplate-3.1-b1.jar****
>>
>> commons-codec-1.3.jar          guava-r06.jar
>> hive-hwi-0.8.1.war            libfb303-0.7.0.jar    velocity-1.5.jar****
>>
>> commons-collections-3.2.1.jar  hbase-0.89.0-SNAPSHOT.jar
>>         hive-jdbc-0.8.1.jar           libfb303.jar
>> zookeeper-3.3.1.jar****
>>
>> commons-dbcp-1.4.jar           hbase-0.89.0-SNAPSHOT-tests.jar
>> hive-metastore-0.8.1.jar      libthrift-0.7.0.jar****
>>
>> commons-lang-2.4.jar           hive-anttasks-0.8.1.jar
>>             hive-pdk-0.8.1.jar            libthrift.jar****
>>
>> commons-logging-1.0.4.jar      hive-builtins-0.8.1.jar
>> hive-serde-0.8.1.jar          log4j-1.2.15.jar****
>>
>> commons-logging-api-1.0.4.jar  hive-cli-0.8.1.jar
>> hive-service-0.8.1.jar        log4j-1.2.16.jar****
>>
>> ** **
>>
>> Also I found a derby.log in my home directory which I have attached.****
>>
>> ** **
>>
>> Regards,****
>>
>> ** **
>>
>> Z****
>>
>> ** **
>>
>> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
>> *Sent:* 31 October 2012 21:58
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Creating Indexes****
>>
>>  ** **
>>
>> Hi Peter,****
>>
>> ** **
>>
>> From the execution log,****
>>
>> ** **
>>
>> java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver***
>> *
>>
>>           at java.net.URLClassLoader$1.run(URLClassLoader.java:366)****
>>
>>           at java.net.URLClassLoader$1.run(URLClassLoader.java:355)****
>>
>>           at java.security.AccessController.doPrivileged(Native Method)**
>> **
>>
>>           at java.net.URLClassLoader.findClass(URLClassLoader.java:354)**
>> **
>>
>>           at java.lang.ClassLoader.loadClass(ClassLoader.java:423)****
>>
>>           at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)****
>>
>>           at java.lang.ClassLoader.loadClass(ClassLoader.java:356)****
>>
>>           at java.lang.Class.forName0(Native Method)****
>>
>>           at java.lang.Class.forName(Class.java:186)****
>>
>>           at
>> org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:68)
>> ****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:778)
>> ****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:723)
>> ****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>>
>>           at
>> org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)***
>> *
>>
>>           at
>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)***
>> *
>>
>>           at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>> ****
>>
>>           at org.apache.hadoop.mapred.Child$4.run(Child.java:255)****
>>
>>           at java.security.AccessController.doPrivileged(Native Method)**
>> **
>>
>>           at javax.security.auth.Subject.doAs(Subject.java:415)****
>>
>>           at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>> ****
>>
>>           at org.apache.hadoop.mapred.Child.main(Child.java:249)****
>>
>> ** **
>>
>> It appears that the error is due derby classes not being found. Can you
>> check if the derby jars are present?****
>>
>> ** **
>>
>> Thanks,****
>>
>> Shreepadma****
>>
>> ** **
>>
>> ** **
>>
>> On Wed, Oct 31, 2012 at 12:52 PM, Peter Marron <
>> Peter.Marron@trilliumsoftware.com> wrote:****
>>
>> Hi Shreepadma****
>>
>>  ****
>>
>> Happy to attach the logs, not quite sure which one is going to****
>>
>> be most useful. Please find attached one which contained an****
>>
>> error of some sort. Not sure it it’s related or not to the index error.**
>> **
>>
>> Found the file in this location:****
>>
>>  ****
>>
>>
>> /data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog
>> ****
>>
>>  ****
>>
>> so maybe that will help you locate any other file that you might want to
>> see.****
>>
>>  ****
>>
>> Thanks for your efforts.****
>>
>>  ****
>>
>> Peter Marron****
>>
>>  ****
>>
>> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
>> *Sent:* 31 October 2012 18:38
>> *To:* user@hive.apache.org
>> *Subject:* Re: Creating Indexes****
>>
>>  ****
>>
>> Hi Peter,****
>>
>>  ****
>>
>> Can you attach the execution logs? What is the exception that you see in
>> the execution logs?****
>>
>>  ****
>>
>> Thanks,****
>>
>> Shreepadma ****
>>
>>  ****
>>
>> On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <
>> Peter.Marron@trilliumsoftware.com> wrote:****
>>
>> Hi,****
>>
>>  ****
>>
>> I am still having problems building my index.****
>>
>> In an attempt to find someone who can help me****
>>
>> I’ll go through all the steps that I try.****
>>
>>  ****
>>
>> 1)      First I load my data into hive.****
>>
>>  ****
>>
>> hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;****
>>
>> Loading data to table default.score****
>>
>> Deleted hdfs://localhost/data/warehouse/score****
>>
>> OK****
>>
>> Time taken: 7.817 seconds****
>>
>>  ****
>>
>> 2)      Then I try to create the index ****
>>
>>  ****
>>
>> hive> CREATE INDEX bigIndex****
>>
>>     > ON TABLE score(Ath_Seq_Num)****
>>
>>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';**
>> **
>>
>> FAILED: Error in metadata: java.lang.RuntimeException: Please specify
>> deferred rebuild using " WITH DEFERRED REBUILD ".****
>>
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.DDLTask****
>>
>> hive> ****
>>
>>  ****
>>
>> 3)      OK, so it suggests that I use “DEFERRED BUILD” and so I do****
>>
>> hive> ****
>>
>>     > ****
>>
>>     > CREATE INDEX bigIndex****
>>
>>     > ON TABLE score(Ath_Seq_Num)****
>>
>>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'***
>> *
>>
>>     > WITH DEFERRED REBUILD;****
>>
>> OK****
>>
>> Time taken: 0.603 seconds****
>>
>>  ****
>>
>> 4)      Now, to create the index I assume that I use ALTER INDEX as
>> follows:****
>>
>>  ****
>>
>> hive>ALTER INDEX bigIndex ON score REBUILD;****
>>
>> Total MapReduce jobs = 1****
>>
>> Launching Job 1 out of 1****
>>
>> Number of reduce tasks not specified. Estimated from input data size: 138
>> ****
>>
>> In order to change the average load for a reducer (in bytes):****
>>
>>   set hive.exec.reducers.bytes.per.reducer=<number>****
>>
>> In order to limit the maximum number of reducers:****
>>
>>   set hive.exec.reducers.max=<number>****
>>
>> In order to set a constant number of reducers:****
>>
>>   set mapred.reduce.tasks=<number>****
>>
>> Starting Job = job_201210311448_0001, Tracking URL =
>> http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001****
>>
>> Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job
>> -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001****
>>
>> Hadoop job information for Stage-1: number of mappers: 511; number of
>> reducers: 138****
>>
>> 2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%****
>>
>>  ****
>>
>> 5)      This all looks promising, and after increasing my heapsize to
>> get the Map/Reduce to complete, I get this an hour later****
>>
>>  ****
>>
>> 2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative
>> CPU 4135.47 sec****
>>
>> MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds
>> 470 msec****
>>
>> Ended Job = job_201210311448_0001****
>>
>> Loading data to table default.default__score_bigindex__****
>>
>> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>>
>> Invalid alter operation: Unable to alter index.****
>>
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.DDLTask****
>>
>>  ****
>>
>> So what have I done wrong, and what am I to do to get this index to build
>> successfully?****
>>
>>  ****
>>
>> Any help appreciated.****
>>
>>  ****
>>
>> Peter Marron****
>>
>>  ****
>>
>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>> *Sent:* 24 October 2012 13:27
>> *To:* user@hive.apache.org
>> *Subject:* RE: Indexes****
>>
>>  ****
>>
>> Hi Shreepadma,****
>>
>>  ****
>>
>> Thanks for this. Looks exactly like the information I need.****
>>
>> I was going to reply when I had tried it all out, but I’m having****
>>
>> problems creating the index at the moment (I’m getting an****
>>
>> OutOfMemoryError at the moment). So I thought that I had****
>>
>> better reply now to say thank you.****
>>
>>  ****
>>
>> Peter Marron****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<sh...@cloudera.com>]
>>
>> *Sent:* 23 October 2012 19:49
>> *To:* user@hive.apache.org
>> *Subject:* Re: Indexes****
>>
>>  ****
>>
>> Hi Peter,****
>>
>>  ****
>>
>> Indexing support was added to Hive in 0.7 and in 0.8 the query compiler
>> was enhanced to optimized some class of queries (certain group bys and
>> joins) using indexes. Assuming you are using the built in index handler you
>> need to do the following _after_ you have created and rebuilt the index,*
>> ***
>>
>>  ****
>>
>> SET hive.index.compact.file='/tmp/index_result';****
>>
>> SET
>> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
>> ****
>>
>>  ****
>>
>> You will then notice speed up for a query of the form,****
>>
>>  ****
>>
>> select count(*) from tab where indexed_col = some_val****
>>
>>  ****
>>
>> Thanks,****
>>
>> Shreepadma****
>>
>>  ****
>>
>> On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <
>> Peter.Marron@trilliumsoftware.com> wrote:****
>>
>> Hi,****
>>
>>  ****
>>
>> I’m very much a Hive newbie but I’ve been looking at HIVE-417 and this
>> page in particular:****
>>
>> http://cwiki.apache.org/confluence/display/Hive/IndexDev****
>>
>> Using this information I’ve been able to create an index (using Hive
>> 0.8.1)****
>>
>> and when I look at the contents it all looks very promising indeed.****
>>
>> However on the same page there’s this comment:****
>>
>>  ****
>>
>> “…This document currently only covers index creation and maintenance. A
>> follow-on will explain how indexes are used to optimize queries (building
>> on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>
>> )….”****
>>
>>  ****
>>
>> However I can’t find the “follow-on” which tells me how to exploit the
>> index that I’ve****
>>
>> created to “optimize” subsequent queries.****
>>
>> Now I’ve been told that I can create and use indexes with the current****
>>
>> release of Hive _*without*_ writing and developing any Java code of my
>> own.****
>>
>> Is this true? If so, how?****
>>
>>  ****
>>
>> Any help appreciated.****
>>
>>  ****
>>
>> Peter Marron.****
>>
>>  ****
>>
>>  ****
>>
>> ** **
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Creating Indexes

Posted by Bejoy KS <be...@yahoo.com>.
AFAIK you don't any hive jars on cluster. The hive jars are just required on the client node .

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Dean Wampler <de...@thinkbiganalytics.com>
Date: Thu, 1 Nov 2012 08:01:51 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: Creating Indexes

It looks like you're using Derby with a real cluster, not just a single
machine in local or pseudo-distributed mode. I haven't tried this myself,
but the derby jar is probably not on the machine that ran the reducer task
that failed.

dean

On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Shreepadma,****
>
> ** **
>
> I agree that the error looks odd. However I can’t believe that I would have
> ****
>
> got this far with Hive if there was no derby jar. Nevertheless I checked.*
> ***
>
> Here is a directory listing of the Hive install:****
>
> ** **
>
> pmarron@pmarron-ubuntu:/data/hive/lib$ ls****
>
> ant-contrib-1.0b3.jar          commons-pool-1.5.4.jar
> hive-common-0.8.1.jar         hive-shims-0.8.1.jar  mockito-all-1.8.2.jar*
> ***
>
> antlr-2.7.7.jar                datanucleus-connectionpool-2.0.3.jar
> hive-contrib-0.8.1.jar        javaewah-0.3.jar      php****
>
> antlr-3.0.1.jar                datanucleus-core-2.0.3.jar
> hive_contrib.jar              jdo2-api-2.3-ec.jar   py****
>
> antlr-runtime-3.0.1.jar        datanucleus-enhancer-2.0.3.jar
> hive-exec-0.8.1.jar           jline-0.9.94.jar      slf4j-api-1.6.1.jar***
> *
>
> asm-3.1.jar                    datanucleus-rdbms-2.0.3.jar
> hive-hbase-handler-0.8.1.jar  json-20090211.jar     slf4j-log4j12-1.6.1.jar
> ****
>
> commons-cli-1.2.jar            *derby-10.4.2.0.jar*
> hive-hwi-0.8.1.jar            junit-4.10.jar
> stringtemplate-3.1-b1.jar****
>
> commons-codec-1.3.jar          guava-r06.jar
> hive-hwi-0.8.1.war            libfb303-0.7.0.jar    velocity-1.5.jar****
>
> commons-collections-3.2.1.jar  hbase-0.89.0-SNAPSHOT.jar
>         hive-jdbc-0.8.1.jar           libfb303.jar
> zookeeper-3.3.1.jar****
>
> commons-dbcp-1.4.jar           hbase-0.89.0-SNAPSHOT-tests.jar
> hive-metastore-0.8.1.jar      libthrift-0.7.0.jar****
>
> commons-lang-2.4.jar           hive-anttasks-0.8.1.jar
>             hive-pdk-0.8.1.jar            libthrift.jar****
>
> commons-logging-1.0.4.jar      hive-builtins-0.8.1.jar
> hive-serde-0.8.1.jar          log4j-1.2.15.jar****
>
> commons-logging-api-1.0.4.jar  hive-cli-0.8.1.jar
> hive-service-0.8.1.jar        log4j-1.2.16.jar****
>
> ** **
>
> Also I found a derby.log in my home directory which I have attached.****
>
> ** **
>
> Regards,****
>
> ** **
>
> Z****
>
> ** **
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
> *Sent:* 31 October 2012 21:58
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ** **
>
> Hi Peter,****
>
> ** **
>
> From the execution log,****
>
> ** **
>
> java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver****
>
>           at java.net.URLClassLoader$1.run(URLClassLoader.java:366)****
>
>           at java.net.URLClassLoader$1.run(URLClassLoader.java:355)****
>
>           at java.security.AccessController.doPrivileged(Native Method)***
> *
>
>           at java.net.URLClassLoader.findClass(URLClassLoader.java:354)***
> *
>
>           at java.lang.ClassLoader.loadClass(ClassLoader.java:423)****
>
>           at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> ****
>
>           at java.lang.ClassLoader.loadClass(ClassLoader.java:356)****
>
>           at java.lang.Class.forName0(Native Method)****
>
>           at java.lang.Class.forName(Class.java:186)****
>
>           at
> org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:68)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:778)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:723)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>
>           at
> org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)****
>
>           at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)****
>
>           at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)*
> ***
>
>           at org.apache.hadoop.mapred.Child$4.run(Child.java:255)****
>
>           at java.security.AccessController.doPrivileged(Native Method)***
> *
>
>           at javax.security.auth.Subject.doAs(Subject.java:415)****
>
>           at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ****
>
>           at org.apache.hadoop.mapred.Child.main(Child.java:249)****
>
> ** **
>
> It appears that the error is due derby classes not being found. Can you
> check if the derby jars are present?****
>
> ** **
>
> Thanks,****
>
> Shreepadma****
>
> ** **
>
> ** **
>
> On Wed, Oct 31, 2012 at 12:52 PM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Shreepadma****
>
>  ****
>
> Happy to attach the logs, not quite sure which one is going to****
>
> be most useful. Please find attached one which contained an****
>
> error of some sort. Not sure it it’s related or not to the index error.***
> *
>
> Found the file in this location:****
>
>  ****
>
>
> /data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog
> ****
>
>  ****
>
> so maybe that will help you locate any other file that you might want to
> see.****
>
>  ****
>
> Thanks for your efforts.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
> *Sent:* 31 October 2012 18:38
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ****
>
> Hi Peter,****
>
>  ****
>
> Can you attach the execution logs? What is the exception that you see in
> the execution logs?****
>
>  ****
>
> Thanks,****
>
> Shreepadma ****
>
>  ****
>
> On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I am still having problems building my index.****
>
> In an attempt to find someone who can help me****
>
> I’ll go through all the steps that I try.****
>
>  ****
>
> 1)      First I load my data into hive.****
>
>  ****
>
> hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;****
>
> Loading data to table default.score****
>
> Deleted hdfs://localhost/data/warehouse/score****
>
> OK****
>
> Time taken: 7.817 seconds****
>
>  ****
>
> 2)      Then I try to create the index ****
>
>  ****
>
> hive> CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';***
> *
>
> FAILED: Error in metadata: java.lang.RuntimeException: Please specify
> deferred rebuild using " WITH DEFERRED REBUILD ".****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> hive> ****
>
>  ****
>
> 3)      OK, so it suggests that I use “DEFERRED BUILD” and so I do****
>
> hive> ****
>
>     > ****
>
>     > CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'****
>
>     > WITH DEFERRED REBUILD;****
>
> OK****
>
> Time taken: 0.603 seconds****
>
>  ****
>
> 4)      Now, to create the index I assume that I use ALTER INDEX as
> follows:****
>
>  ****
>
> hive>ALTER INDEX bigIndex ON score REBUILD;****
>
> Total MapReduce jobs = 1****
>
> Launching Job 1 out of 1****
>
> Number of reduce tasks not specified. Estimated from input data size: 138*
> ***
>
> In order to change the average load for a reducer (in bytes):****
>
>   set hive.exec.reducers.bytes.per.reducer=<number>****
>
> In order to limit the maximum number of reducers:****
>
>   set hive.exec.reducers.max=<number>****
>
> In order to set a constant number of reducers:****
>
>   set mapred.reduce.tasks=<number>****
>
> Starting Job = job_201210311448_0001, Tracking URL =
> http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001****
>
> Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job
> -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001****
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> 2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%****
>
>  ****
>
> 5)      This all looks promising, and after increasing my heapsize to get
> the Map/Reduce to complete, I get this an hour later****
>
>  ****
>
> 2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 4135.47 sec****
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds
> 470 msec****
>
> Ended Job = job_201210311448_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
>  ****
>
> So what have I done wrong, and what am I to do to get this index to build
> successfully?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 24 October 2012 13:27
> *To:* user@hive.apache.org
> *Subject:* RE: Indexes****
>
>  ****
>
> Hi Shreepadma,****
>
>  ****
>
> Thanks for this. Looks exactly like the information I need.****
>
> I was going to reply when I had tried it all out, but I’m having****
>
> problems creating the index at the moment (I’m getting an****
>
> OutOfMemoryError at the moment). So I thought that I had****
>
> better reply now to say thank you.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
>  ****
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<sh...@cloudera.com>]
>
> *Sent:* 23 October 2012 19:49
> *To:* user@hive.apache.org
> *Subject:* Re: Indexes****
>
>  ****
>
> Hi Peter,****
>
>  ****
>
> Indexing support was added to Hive in 0.7 and in 0.8 the query compiler
> was enhanced to optimized some class of queries (certain group bys and
> joins) using indexes. Assuming you are using the built in index handler you
> need to do the following _after_ you have created and rebuilt the index,**
> **
>
>  ****
>
> SET hive.index.compact.file='/tmp/index_result';****
>
> SET
> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
> ****
>
>  ****
>
> You will then notice speed up for a query of the form,****
>
>  ****
>
> select count(*) from tab where indexed_col = some_val****
>
>  ****
>
> Thanks,****
>
> Shreepadma****
>
>  ****
>
> On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I’m very much a Hive newbie but I’ve been looking at HIVE-417 and this
> page in particular:****
>
> http://cwiki.apache.org/confluence/display/Hive/IndexDev****
>
> Using this information I’ve been able to create an index (using Hive 0.8.1)
> ****
>
> and when I look at the contents it all looks very promising indeed.****
>
> However on the same page there’s this comment:****
>
>  ****
>
> “…This document currently only covers index creation and maintenance. A
> follow-on will explain how indexes are used to optimize queries (building
> on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>
> )….”****
>
>  ****
>
> However I can’t find the “follow-on” which tells me how to exploit the
> index that I’ve****
>
> created to “optimize” subsequent queries.****
>
> Now I’ve been told that I can create and use indexes with the current****
>
> release of Hive _*without*_ writing and developing any Java code of my
> own.****
>
> Is this true? If so, how?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron.****
>
>  ****
>
>  ****
>
> ** **
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


RE: Creating Indexes

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Hi Shreepadma,

I have looked and I can't find anything that looks like a log with any more
information in it. The vast bulk of logs that I find seem to concern the
Map/Reduce which we agree has succeeded.
The only thing that looks a little odd is in the syslogs for the reducers.
All 138 of them seem to feature a line like this:

2012-11-07 11:10:17,960 ERROR org.apache.hadoop.hive.ql.exec.FileSinkOperator: StatsPublishing error: cannot connect to database

I don't know if this is a serious problem or not or whether it might have
bearing on the subsequent Hive error?
I have attached the last reducer syslog so that you can see this problem in context.
Also I managed to find a hive.log, attached.
It seems to have lots of warnings and "errors" but I don't know if any of them
are relevant.

I'm being pressed to make some progress on this problem, but
I don't know what to do next. Is there any other files that I can
provide that might help? Do I need to add some instrumentation?
I guess that I could re-build various Hive class files to get more information.
But I really don't know Hive at all and so I suspect that it'll take me
ages flailing around before I find anything helpful.
Any suggestions for a way forward gratefully received,
I'm happy to make any changes that might help.

Regards,

Z

From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 03 November 2012 00:06
To: user@hive.apache.org
Subject: Re: Creating Indexes

Hi Peter,

While it looks like the map-red task may have succeeded it looks like the alter index actually failed. You should look into the execution log to see what the exception is. Without knowing why the DDLtask failed its hard to pinpoint the problem.

As for the original problem with the jar as Dean pointed out for some odd reason the jar was not on the classpath prior to the add jar.

Thanks,
Shreepadma
On Fri, Nov 2, 2012 at 4:59 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Dean,

At this stage I'm really not worried about this being a hack.
I just want to get it to work, and I'm grateful for all your help.
I did as you suggested and now, as far as I can see, the Map/Reduce
has succeeded. When I look in the log for the last reduce I no longer
find an error. However this is the output from the hive command
session:

MapReduce Total cumulative CPU time: 0 days 1 hours 14 minutes 51 seconds 360 msec
Ended Job = job_201211021743_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
Table default.default__score_bigindex__ stats: [num_partitions: 0, num_files: 138, num_rows: 0, total_size: 446609024, raw_data_size: 0]
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
MapReduce Jobs Launched:
Job 0: Map: 511  Reduce: 138   Accumulative CPU: 4491.36 sec   HDFS Read: 137123460712 HDFS Write: 446609024 SUCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 14 minutes 51 seconds 360 msec
hive>

I find this very confusing. We have the bit where it says "Job 0:.... SUCCESS"
and this seems to fit with the fact that I can't find errors in the Map/Reduce.
On the other hand we have the bit where it says: "Invalid alter operation: Unable to alter index."
So has it successfully created the index  or not? And if not, then what do I do next?
Is there somewhere else where it records Hive errors as opposed to Map/Reduce errors?

Regards,

Peter Marron


From: Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com<ma...@thinkbiganalytics.com>]
Sent: 02 November 2012 14:03

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Creating Indexes

Oh, I saw this line in your Hive output and just assumed you were running in a cluster:

Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138

I haven't tried running a job that big in pseudodistributed mode either, but that's beside the point.

So it seems to be an issue with indexing, but it still begs the question why derby isn't on the classpath for the task. I would try using the ADD JAR command, which copies the jar around the "cluster" and puts it on the classpath. It's what you would use with UDFs, for example:

ADD JAR /path/to/derby.jar
ALTER INDEX ...;

It's a huge hack, but it just might work.
dean

On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Dean,

I'm running everything on a single physical machine in pseudo-distributed mode.

Well it certainly looks like the reducer is looking for a derby.jar, although I must
confess I don't really understand why it would be doing that.
In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into the
Hadoop directory, where I assume that the reducer would be able to find it.
However I get exactly the same problem as before.
Is there some particular place that I should put the derby.jar to make this
problem go away? Is there anything else that I can try?

Peter Marron

From: Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com<ma...@thinkbiganalytics.com>]
Sent: 01 November 2012 13:02

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Creating Indexes

It looks like you're using Derby with a real cluster, not just a single machine in local or pseudo-distributed mode. I haven't tried this myself, but the derby jar is probably not on the machine that ran the reducer task that failed.

dean
On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Shreepadma,

I agree that the error looks odd. However I can't believe that I would have
got this far with Hive if there was no derby jar. Nevertheless I checked.
Here is a directory listing of the Hive install:
 [snip]




--
Dean Wampler, Ph.D.
thinkbiganalytics.com<http://thinkbiganalytics.com>
+1-312-339-1330<tel:%2B1-312-339-1330>




Re: Creating Indexes

Posted by Shreepadma Venugopalan <sh...@cloudera.com>.
Hi Peter,

While it looks like the map-red task may have succeeded it looks like the
alter index actually failed. You should look into the execution log to see
what the exception is. Without knowing why the DDLtask failed its hard to
pinpoint the problem.

As for the original problem with the jar as Dean pointed out for some odd
reason the jar was not on the classpath prior to the add jar.

Thanks,
Shreepadma

On Fri, Nov 2, 2012 at 4:59 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Dean,****
>
> ** **
>
> At this stage I’m really not worried about this being a hack.****
>
> I just want to get it to work, and I’m grateful for all your help.****
>
> I did as you suggested and now, as far as I can see, the Map/Reduce****
>
> has succeeded. When I look in the log for the last reduce I no longer****
>
> find an error. However this is the output from the hive command****
>
> session:****
>
> ** **
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 14 minutes 51 seconds
> 360 msec****
>
> Ended Job = job_201211021743_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> Table default.default__score_bigindex__ stats: [num_partitions: 0,
> num_files: 138, num_rows: 0, total_size: 446609024, raw_data_size: 0]****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> MapReduce Jobs Launched: ****
>
> Job 0: Map: 511  Reduce: 138   Accumulative CPU: 4491.36 sec   HDFS Read:
> 137123460712 HDFS Write: 446609024 SUCESS****
>
> Total MapReduce CPU Time Spent: 0 days 1 hours 14 minutes 51 seconds 360
> msec****
>
> hive>     ****
>
> ** **
>
> I find this very confusing. We have the bit where it says “Job 0:….
> SUCCESS”****
>
> and this seems to fit with the fact that I can’t find errors in the
> Map/Reduce.****
>
> On the other hand we have the bit where it says: “Invalid alter operation:
> Unable to alter index.”****
>
> So has it successfully created the index  or not? And if not, then what do
> I do next?****
>
> Is there somewhere else where it records Hive errors as opposed to
> Map/Reduce errors?****
>
> ** **
>
> Regards,****
>
> ** **
>
> Peter Marron****
>
>                                    ****
>
> ** **
>
> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
> *Sent:* 02 November 2012 14:03
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ** **
>
> Oh, I saw this line in your Hive output and just assumed you were running
> in a cluster:****
>
> ** **
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> ** **
>
> I haven't tried running a job that big in pseudodistributed mode either,
> but that's beside the point.****
>
> ** **
>
> So it seems to be an issue with indexing, but it still begs the question
> why derby isn't on the classpath for the task. I would try using the ADD
> JAR command, which copies the jar around the "cluster" and puts it on the
> classpath. It's what you would use with UDFs, for example:****
>
> ** **
>
> ADD JAR /path/to/derby.jar****
>
> ALTER INDEX ...;****
>
> ** **
>
> It's a huge hack, but it just might work.****
>
> dean****
>
> ** **
>
> On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Dean,****
>
>  ****
>
> I’m running everything on a single physical machine in pseudo-distributed
> mode.****
>
>  ****
>
> Well it certainly looks like the reducer is looking for a derby.jar,
> although I must****
>
> confess I don’t really understand why it would be doing that.****
>
> In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into
> the****
>
> Hadoop directory, where I assume that the reducer would be able to find it.
> ****
>
> However I get exactly the same problem as before.****
>
> Is there some particular place that I should put the derby.jar to make this
> ****
>
> problem go away? Is there anything else that I can try?****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
> *Sent:* 01 November 2012 13:02****
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ****
>
> It looks like you're using Derby with a real cluster, not just a single
> machine in local or pseudo-distributed mode. I haven't tried this myself,
> but the derby jar is probably not on the machine that ran the reducer task
> that failed.****
>
>  ****
>
> dean****
>
> On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Shreepadma,****
>
>  ****
>
> I agree that the error looks odd. However I can’t believe that I would have
> ****
>
> got this far with Hive if there was no derby jar. Nevertheless I checked.*
> ***
>
> Here is a directory listing of the Hive install:****
>
>  [snip]****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> *Dean Wampler, Ph.D.*****
>
> thinkbiganalytics.com****
>
> +1-312-339-1330****
>
> ** **
>
> ** **
>

Re: Creating Indexes

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
Hit send too soon...

I'm glad the ADD JAR hack appeared to work. You might verify if the
temporary files mentioned are still there and also verify that you have
write permissions for the target index directories. Other than that, I'm
not sure what to suggest. I haven't really used indexing much, but it
appears to be an immature feature, unfortunately.

dean

On Fri, Nov 2, 2012 at 7:02 PM, Dean Wampler <
dean.wampler@thinkbiganalytics.com> wrote:

> Wow. Lots of quirks. I'm glad the ADD JAR
>
>
> On Fri, Nov 2, 2012 at 6:59 PM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:
>
>>  Hi Dean,****
>>
>> ** **
>>
>> At this stage I’m really not worried about this being a hack.****
>>
>> I just want to get it to work, and I’m grateful for all your help.****
>>
>> I did as you suggested and now, as far as I can see, the Map/Reduce****
>>
>> has succeeded. When I look in the log for the last reduce I no longer****
>>
>> find an error. However this is the output from the hive command****
>>
>> session:****
>>
>> ** **
>>
>> MapReduce Total cumulative CPU time: 0 days 1 hours 14 minutes 51 seconds
>> 360 msec****
>>
>> Ended Job = job_201211021743_0001****
>>
>> Loading data to table default.default__score_bigindex__****
>>
>> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>>
>> Invalid alter operation: Unable to alter index.****
>>
>> Table default.default__score_bigindex__ stats: [num_partitions: 0,
>> num_files: 138, num_rows: 0, total_size: 446609024, raw_data_size: 0]****
>>
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.DDLTask****
>>
>> MapReduce Jobs Launched: ****
>>
>> Job 0: Map: 511  Reduce: 138   Accumulative CPU: 4491.36 sec   HDFS Read:
>> 137123460712 HDFS Write: 446609024 SUCESS****
>>
>> Total MapReduce CPU Time Spent: 0 days 1 hours 14 minutes 51 seconds 360
>> msec****
>>
>> hive>     ****
>>
>> ** **
>>
>> I find this very confusing. We have the bit where it says “Job 0:….
>> SUCCESS”****
>>
>> and this seems to fit with the fact that I can’t find errors in the
>> Map/Reduce.****
>>
>> On the other hand we have the bit where it says: “Invalid alter
>> operation: Unable to alter index.”****
>>
>> So has it successfully created the index  or not? And if not, then what
>> do I do next?****
>>
>> Is there somewhere else where it records Hive errors as opposed to
>> Map/Reduce errors?****
>>
>> ** **
>>
>> Regards,****
>>
>> ** **
>>
>> Peter Marron****
>>
>>                                    ****
>>
>> ** **
>>
>> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
>> *Sent:* 02 November 2012 14:03
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Creating Indexes****
>>
>>  ** **
>>
>> Oh, I saw this line in your Hive output and just assumed you were running
>> in a cluster:****
>>
>> ** **
>>
>> Hadoop job information for Stage-1: number of mappers: 511; number of
>> reducers: 138****
>>
>> ** **
>>
>> I haven't tried running a job that big in pseudodistributed mode either,
>> but that's beside the point.****
>>
>> ** **
>>
>> So it seems to be an issue with indexing, but it still begs the question
>> why derby isn't on the classpath for the task. I would try using the ADD
>> JAR command, which copies the jar around the "cluster" and puts it on the
>> classpath. It's what you would use with UDFs, for example:****
>>
>> ** **
>>
>> ADD JAR /path/to/derby.jar****
>>
>> ALTER INDEX ...;****
>>
>> ** **
>>
>> It's a huge hack, but it just might work.****
>>
>> dean****
>>
>> ** **
>>
>> On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <
>> Peter.Marron@trilliumsoftware.com> wrote:****
>>
>> Hi Dean,****
>>
>>  ****
>>
>> I’m running everything on a single physical machine in pseudo-distributed
>> mode.****
>>
>>  ****
>>
>> Well it certainly looks like the reducer is looking for a derby.jar,
>> although I must****
>>
>> confess I don’t really understand why it would be doing that.****
>>
>> In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into
>> the****
>>
>> Hadoop directory, where I assume that the reducer would be able to find
>> it.****
>>
>> However I get exactly the same problem as before.****
>>
>> Is there some particular place that I should put the derby.jar to make
>> this****
>>
>> problem go away? Is there anything else that I can try?****
>>
>>  ****
>>
>> Peter Marron****
>>
>>  ****
>>
>> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
>> *Sent:* 01 November 2012 13:02****
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Creating Indexes****
>>
>>  ****
>>
>> It looks like you're using Derby with a real cluster, not just a single
>> machine in local or pseudo-distributed mode. I haven't tried this myself,
>> but the derby jar is probably not on the machine that ran the reducer task
>> that failed.****
>>
>>  ****
>>
>> dean****
>>
>> On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
>> Peter.Marron@trilliumsoftware.com> wrote:****
>>
>> Hi Shreepadma,****
>>
>>  ****
>>
>> I agree that the error looks odd. However I can’t believe that I would
>> have****
>>
>> got this far with Hive if there was no derby jar. Nevertheless I checked.
>> ****
>>
>> Here is a directory listing of the Hive install:****
>>
>>  [snip]****
>>
>>  ****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> --
>> *Dean Wampler, Ph.D.*****
>>
>> thinkbiganalytics.com****
>>
>> +1-312-339-1330****
>>
>> ** **
>>
>> ** **
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Creating Indexes

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
Wow. Lots of quirks. I'm glad the ADD JAR

On Fri, Nov 2, 2012 at 6:59 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Dean,****
>
> ** **
>
> At this stage I’m really not worried about this being a hack.****
>
> I just want to get it to work, and I’m grateful for all your help.****
>
> I did as you suggested and now, as far as I can see, the Map/Reduce****
>
> has succeeded. When I look in the log for the last reduce I no longer****
>
> find an error. However this is the output from the hive command****
>
> session:****
>
> ** **
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 14 minutes 51 seconds
> 360 msec****
>
> Ended Job = job_201211021743_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> Table default.default__score_bigindex__ stats: [num_partitions: 0,
> num_files: 138, num_rows: 0, total_size: 446609024, raw_data_size: 0]****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> MapReduce Jobs Launched: ****
>
> Job 0: Map: 511  Reduce: 138   Accumulative CPU: 4491.36 sec   HDFS Read:
> 137123460712 HDFS Write: 446609024 SUCESS****
>
> Total MapReduce CPU Time Spent: 0 days 1 hours 14 minutes 51 seconds 360
> msec****
>
> hive>     ****
>
> ** **
>
> I find this very confusing. We have the bit where it says “Job 0:….
> SUCCESS”****
>
> and this seems to fit with the fact that I can’t find errors in the
> Map/Reduce.****
>
> On the other hand we have the bit where it says: “Invalid alter operation:
> Unable to alter index.”****
>
> So has it successfully created the index  or not? And if not, then what do
> I do next?****
>
> Is there somewhere else where it records Hive errors as opposed to
> Map/Reduce errors?****
>
> ** **
>
> Regards,****
>
> ** **
>
> Peter Marron****
>
>                                    ****
>
> ** **
>
> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
> *Sent:* 02 November 2012 14:03
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ** **
>
> Oh, I saw this line in your Hive output and just assumed you were running
> in a cluster:****
>
> ** **
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> ** **
>
> I haven't tried running a job that big in pseudodistributed mode either,
> but that's beside the point.****
>
> ** **
>
> So it seems to be an issue with indexing, but it still begs the question
> why derby isn't on the classpath for the task. I would try using the ADD
> JAR command, which copies the jar around the "cluster" and puts it on the
> classpath. It's what you would use with UDFs, for example:****
>
> ** **
>
> ADD JAR /path/to/derby.jar****
>
> ALTER INDEX ...;****
>
> ** **
>
> It's a huge hack, but it just might work.****
>
> dean****
>
> ** **
>
> On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Dean,****
>
>  ****
>
> I’m running everything on a single physical machine in pseudo-distributed
> mode.****
>
>  ****
>
> Well it certainly looks like the reducer is looking for a derby.jar,
> although I must****
>
> confess I don’t really understand why it would be doing that.****
>
> In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into
> the****
>
> Hadoop directory, where I assume that the reducer would be able to find it.
> ****
>
> However I get exactly the same problem as before.****
>
> Is there some particular place that I should put the derby.jar to make this
> ****
>
> problem go away? Is there anything else that I can try?****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
> *Sent:* 01 November 2012 13:02****
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ****
>
> It looks like you're using Derby with a real cluster, not just a single
> machine in local or pseudo-distributed mode. I haven't tried this myself,
> but the derby jar is probably not on the machine that ran the reducer task
> that failed.****
>
>  ****
>
> dean****
>
> On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Shreepadma,****
>
>  ****
>
> I agree that the error looks odd. However I can’t believe that I would have
> ****
>
> got this far with Hive if there was no derby jar. Nevertheless I checked.*
> ***
>
> Here is a directory listing of the Hive install:****
>
>  [snip]****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> *Dean Wampler, Ph.D.*****
>
> thinkbiganalytics.com****
>
> +1-312-339-1330****
>
> ** **
>
> ** **
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

RE: Creating Indexes

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Hi Dean,

At this stage I'm really not worried about this being a hack.
I just want to get it to work, and I'm grateful for all your help.
I did as you suggested and now, as far as I can see, the Map/Reduce
has succeeded. When I look in the log for the last reduce I no longer
find an error. However this is the output from the hive command
session:

MapReduce Total cumulative CPU time: 0 days 1 hours 14 minutes 51 seconds 360 msec
Ended Job = job_201211021743_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
Table default.default__score_bigindex__ stats: [num_partitions: 0, num_files: 138, num_rows: 0, total_size: 446609024, raw_data_size: 0]
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
MapReduce Jobs Launched:
Job 0: Map: 511  Reduce: 138   Accumulative CPU: 4491.36 sec   HDFS Read: 137123460712 HDFS Write: 446609024 SUCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 14 minutes 51 seconds 360 msec
hive>

I find this very confusing. We have the bit where it says "Job 0:.... SUCCESS"
and this seems to fit with the fact that I can't find errors in the Map/Reduce.
On the other hand we have the bit where it says: "Invalid alter operation: Unable to alter index."
So has it successfully created the index  or not? And if not, then what do I do next?
Is there somewhere else where it records Hive errors as opposed to Map/Reduce errors?

Regards,

Peter Marron


From: Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
Sent: 02 November 2012 14:03
To: user@hive.apache.org
Subject: Re: Creating Indexes

Oh, I saw this line in your Hive output and just assumed you were running in a cluster:

Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138

I haven't tried running a job that big in pseudodistributed mode either, but that's beside the point.

So it seems to be an issue with indexing, but it still begs the question why derby isn't on the classpath for the task. I would try using the ADD JAR command, which copies the jar around the "cluster" and puts it on the classpath. It's what you would use with UDFs, for example:

ADD JAR /path/to/derby.jar
ALTER INDEX ...;

It's a huge hack, but it just might work.
dean

On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Dean,

I'm running everything on a single physical machine in pseudo-distributed mode.

Well it certainly looks like the reducer is looking for a derby.jar, although I must
confess I don't really understand why it would be doing that.
In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into the
Hadoop directory, where I assume that the reducer would be able to find it.
However I get exactly the same problem as before.
Is there some particular place that I should put the derby.jar to make this
problem go away? Is there anything else that I can try?

Peter Marron

From: Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com<ma...@thinkbiganalytics.com>]
Sent: 01 November 2012 13:02

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Creating Indexes

It looks like you're using Derby with a real cluster, not just a single machine in local or pseudo-distributed mode. I haven't tried this myself, but the derby jar is probably not on the machine that ran the reducer task that failed.

dean
On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Shreepadma,

I agree that the error looks odd. However I can't believe that I would have
got this far with Hive if there was no derby jar. Nevertheless I checked.
Here is a directory listing of the Hive install:
 [snip]




--
Dean Wampler, Ph.D.
thinkbiganalytics.com<http://thinkbiganalytics.com>
+1-312-339-1330



Re: Creating Indexes

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
Oh, I saw this line in your Hive output and just assumed you were running
in a cluster:

Hadoop job information for Stage-1: number of mappers: 511; number of
reducers: 138****

I haven't tried running a job that big in pseudodistributed mode either,
but that's beside the point.

So it seems to be an issue with indexing, but it still begs the question
why derby isn't on the classpath for the task. I would try using the ADD
JAR command, which copies the jar around the "cluster" and puts it on the
classpath. It's what you would use with UDFs, for example:

ADD JAR /path/to/derby.jar
ALTER INDEX ...;

It's a huge hack, but it just might work.
dean

On Fri, Nov 2, 2012 at 3:44 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Dean,****
>
> ** **
>
> I’m running everything on a single physical machine in pseudo-distributed
> mode.****
>
> ** **
>
> Well it certainly looks like the reducer is looking for a derby.jar,
> although I must****
>
> confess I don’t really understand why it would be doing that.****
>
> In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into
> the****
>
> Hadoop directory, where I assume that the reducer would be able to find it.
> ****
>
> However I get exactly the same problem as before.****
>
> Is there some particular place that I should put the derby.jar to make this
> ****
>
> problem go away? Is there anything else that I can try?****
>
> ** **
>
> Peter Marron****
>
> ** **
>
> *From:* Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
> *Sent:* 01 November 2012 13:02
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
> ** **
>
> It looks like you're using Derby with a real cluster, not just a single
> machine in local or pseudo-distributed mode. I haven't tried this myself,
> but the derby jar is probably not on the machine that ran the reducer task
> that failed.****
>
> ** **
>
> dean****
>
> On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Shreepadma,****
>
>  ****
>
> I agree that the error looks odd. However I can’t believe that I would have
> ****
>
> got this far with Hive if there was no derby jar. Nevertheless I checked.*
> ***
>
> Here is a directory listing of the Hive install:****
>
>  [snip]****
>
> ** **
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

RE: Creating Indexes

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Hi Dean,

I'm running everything on a single physical machine in pseudo-distributed mode.

Well it certainly looks like the reducer is looking for a derby.jar, although I must
confess I don't really understand why it would be doing that.
In an effort to fix that I copied the derby.jar (derby-10.4.2.0.jar) into the
Hadoop directory, where I assume that the reducer would be able to find it.
However I get exactly the same problem as before.
Is there some particular place that I should put the derby.jar to make this
problem go away? Is there anything else that I can try?

Peter Marron

From: Dean Wampler [mailto:dean.wampler@thinkbiganalytics.com]
Sent: 01 November 2012 13:02
To: user@hive.apache.org
Subject: Re: Creating Indexes

It looks like you're using Derby with a real cluster, not just a single machine in local or pseudo-distributed mode. I haven't tried this myself, but the derby jar is probably not on the machine that ran the reducer task that failed.

dean
On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Shreepadma,

I agree that the error looks odd. However I can't believe that I would have
got this far with Hive if there was no derby jar. Nevertheless I checked.
Here is a directory listing of the Hive install:
 [snip]


Re: Creating Indexes

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
It looks like you're using Derby with a real cluster, not just a single
machine in local or pseudo-distributed mode. I haven't tried this myself,
but the derby jar is probably not on the machine that ran the reducer task
that failed.

dean

On Thu, Nov 1, 2012 at 4:31 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Shreepadma,****
>
> ** **
>
> I agree that the error looks odd. However I can’t believe that I would have
> ****
>
> got this far with Hive if there was no derby jar. Nevertheless I checked.*
> ***
>
> Here is a directory listing of the Hive install:****
>
> ** **
>
> pmarron@pmarron-ubuntu:/data/hive/lib$ ls****
>
> ant-contrib-1.0b3.jar          commons-pool-1.5.4.jar
> hive-common-0.8.1.jar         hive-shims-0.8.1.jar  mockito-all-1.8.2.jar*
> ***
>
> antlr-2.7.7.jar                datanucleus-connectionpool-2.0.3.jar
> hive-contrib-0.8.1.jar        javaewah-0.3.jar      php****
>
> antlr-3.0.1.jar                datanucleus-core-2.0.3.jar
> hive_contrib.jar              jdo2-api-2.3-ec.jar   py****
>
> antlr-runtime-3.0.1.jar        datanucleus-enhancer-2.0.3.jar
> hive-exec-0.8.1.jar           jline-0.9.94.jar      slf4j-api-1.6.1.jar***
> *
>
> asm-3.1.jar                    datanucleus-rdbms-2.0.3.jar
> hive-hbase-handler-0.8.1.jar  json-20090211.jar     slf4j-log4j12-1.6.1.jar
> ****
>
> commons-cli-1.2.jar            *derby-10.4.2.0.jar*
> hive-hwi-0.8.1.jar            junit-4.10.jar
> stringtemplate-3.1-b1.jar****
>
> commons-codec-1.3.jar          guava-r06.jar
> hive-hwi-0.8.1.war            libfb303-0.7.0.jar    velocity-1.5.jar****
>
> commons-collections-3.2.1.jar  hbase-0.89.0-SNAPSHOT.jar
>         hive-jdbc-0.8.1.jar           libfb303.jar
> zookeeper-3.3.1.jar****
>
> commons-dbcp-1.4.jar           hbase-0.89.0-SNAPSHOT-tests.jar
> hive-metastore-0.8.1.jar      libthrift-0.7.0.jar****
>
> commons-lang-2.4.jar           hive-anttasks-0.8.1.jar
>             hive-pdk-0.8.1.jar            libthrift.jar****
>
> commons-logging-1.0.4.jar      hive-builtins-0.8.1.jar
> hive-serde-0.8.1.jar          log4j-1.2.15.jar****
>
> commons-logging-api-1.0.4.jar  hive-cli-0.8.1.jar
> hive-service-0.8.1.jar        log4j-1.2.16.jar****
>
> ** **
>
> Also I found a derby.log in my home directory which I have attached.****
>
> ** **
>
> Regards,****
>
> ** **
>
> Z****
>
> ** **
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
> *Sent:* 31 October 2012 21:58
>
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ** **
>
> Hi Peter,****
>
> ** **
>
> From the execution log,****
>
> ** **
>
> java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver****
>
>           at java.net.URLClassLoader$1.run(URLClassLoader.java:366)****
>
>           at java.net.URLClassLoader$1.run(URLClassLoader.java:355)****
>
>           at java.security.AccessController.doPrivileged(Native Method)***
> *
>
>           at java.net.URLClassLoader.findClass(URLClassLoader.java:354)***
> *
>
>           at java.lang.ClassLoader.loadClass(ClassLoader.java:423)****
>
>           at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> ****
>
>           at java.lang.ClassLoader.loadClass(ClassLoader.java:356)****
>
>           at java.lang.Class.forName0(Native Method)****
>
>           at java.lang.Class.forName(Class.java:186)****
>
>           at
> org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:68)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:778)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:723)
> ****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>
>           at
> org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)****
>
>           at
> org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)****
>
>           at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)****
>
>           at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)*
> ***
>
>           at org.apache.hadoop.mapred.Child$4.run(Child.java:255)****
>
>           at java.security.AccessController.doPrivileged(Native Method)***
> *
>
>           at javax.security.auth.Subject.doAs(Subject.java:415)****
>
>           at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ****
>
>           at org.apache.hadoop.mapred.Child.main(Child.java:249)****
>
> ** **
>
> It appears that the error is due derby classes not being found. Can you
> check if the derby jars are present?****
>
> ** **
>
> Thanks,****
>
> Shreepadma****
>
> ** **
>
> ** **
>
> On Wed, Oct 31, 2012 at 12:52 PM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi Shreepadma****
>
>  ****
>
> Happy to attach the logs, not quite sure which one is going to****
>
> be most useful. Please find attached one which contained an****
>
> error of some sort. Not sure it it’s related or not to the index error.***
> *
>
> Found the file in this location:****
>
>  ****
>
>
> /data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog
> ****
>
>  ****
>
> so maybe that will help you locate any other file that you might want to
> see.****
>
>  ****
>
> Thanks for your efforts.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
> *Sent:* 31 October 2012 18:38
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
>  ****
>
> Hi Peter,****
>
>  ****
>
> Can you attach the execution logs? What is the exception that you see in
> the execution logs?****
>
>  ****
>
> Thanks,****
>
> Shreepadma ****
>
>  ****
>
> On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I am still having problems building my index.****
>
> In an attempt to find someone who can help me****
>
> I’ll go through all the steps that I try.****
>
>  ****
>
> 1)      First I load my data into hive.****
>
>  ****
>
> hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;****
>
> Loading data to table default.score****
>
> Deleted hdfs://localhost/data/warehouse/score****
>
> OK****
>
> Time taken: 7.817 seconds****
>
>  ****
>
> 2)      Then I try to create the index ****
>
>  ****
>
> hive> CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';***
> *
>
> FAILED: Error in metadata: java.lang.RuntimeException: Please specify
> deferred rebuild using " WITH DEFERRED REBUILD ".****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> hive> ****
>
>  ****
>
> 3)      OK, so it suggests that I use “DEFERRED BUILD” and so I do****
>
> hive> ****
>
>     > ****
>
>     > CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'****
>
>     > WITH DEFERRED REBUILD;****
>
> OK****
>
> Time taken: 0.603 seconds****
>
>  ****
>
> 4)      Now, to create the index I assume that I use ALTER INDEX as
> follows:****
>
>  ****
>
> hive>ALTER INDEX bigIndex ON score REBUILD;****
>
> Total MapReduce jobs = 1****
>
> Launching Job 1 out of 1****
>
> Number of reduce tasks not specified. Estimated from input data size: 138*
> ***
>
> In order to change the average load for a reducer (in bytes):****
>
>   set hive.exec.reducers.bytes.per.reducer=<number>****
>
> In order to limit the maximum number of reducers:****
>
>   set hive.exec.reducers.max=<number>****
>
> In order to set a constant number of reducers:****
>
>   set mapred.reduce.tasks=<number>****
>
> Starting Job = job_201210311448_0001, Tracking URL =
> http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001****
>
> Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job
> -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001****
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> 2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%****
>
>  ****
>
> 5)      This all looks promising, and after increasing my heapsize to get
> the Map/Reduce to complete, I get this an hour later****
>
>  ****
>
> 2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 4135.47 sec****
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds
> 470 msec****
>
> Ended Job = job_201210311448_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
>  ****
>
> So what have I done wrong, and what am I to do to get this index to build
> successfully?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 24 October 2012 13:27
> *To:* user@hive.apache.org
> *Subject:* RE: Indexes****
>
>  ****
>
> Hi Shreepadma,****
>
>  ****
>
> Thanks for this. Looks exactly like the information I need.****
>
> I was going to reply when I had tried it all out, but I’m having****
>
> problems creating the index at the moment (I’m getting an****
>
> OutOfMemoryError at the moment). So I thought that I had****
>
> better reply now to say thank you.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
>  ****
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<sh...@cloudera.com>]
>
> *Sent:* 23 October 2012 19:49
> *To:* user@hive.apache.org
> *Subject:* Re: Indexes****
>
>  ****
>
> Hi Peter,****
>
>  ****
>
> Indexing support was added to Hive in 0.7 and in 0.8 the query compiler
> was enhanced to optimized some class of queries (certain group bys and
> joins) using indexes. Assuming you are using the built in index handler you
> need to do the following _after_ you have created and rebuilt the index,**
> **
>
>  ****
>
> SET hive.index.compact.file='/tmp/index_result';****
>
> SET
> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
> ****
>
>  ****
>
> You will then notice speed up for a query of the form,****
>
>  ****
>
> select count(*) from tab where indexed_col = some_val****
>
>  ****
>
> Thanks,****
>
> Shreepadma****
>
>  ****
>
> On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I’m very much a Hive newbie but I’ve been looking at HIVE-417 and this
> page in particular:****
>
> http://cwiki.apache.org/confluence/display/Hive/IndexDev****
>
> Using this information I’ve been able to create an index (using Hive 0.8.1)
> ****
>
> and when I look at the contents it all looks very promising indeed.****
>
> However on the same page there’s this comment:****
>
>  ****
>
> “…This document currently only covers index creation and maintenance. A
> follow-on will explain how indexes are used to optimize queries (building
> on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>
> )….”****
>
>  ****
>
> However I can’t find the “follow-on” which tells me how to exploit the
> index that I’ve****
>
> created to “optimize” subsequent queries.****
>
> Now I’ve been told that I can create and use indexes with the current****
>
> release of Hive _*without*_ writing and developing any Java code of my
> own.****
>
> Is this true? If so, how?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron.****
>
>  ****
>
>  ****
>
> ** **
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

RE: Creating Indexes

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Hi Shreepadma,

I agree that the error looks odd. However I can't believe that I would have
got this far with Hive if there was no derby jar. Nevertheless I checked.
Here is a directory listing of the Hive install:

pmarron@pmarron-ubuntu:/data/hive/lib$ ls
ant-contrib-1.0b3.jar          commons-pool-1.5.4.jar                hive-common-0.8.1.jar         hive-shims-0.8.1.jar  mockito-all-1.8.2.jar
antlr-2.7.7.jar                datanucleus-connectionpool-2.0.3.jar  hive-contrib-0.8.1.jar        javaewah-0.3.jar      php
antlr-3.0.1.jar                datanucleus-core-2.0.3.jar            hive_contrib.jar              jdo2-api-2.3-ec.jar   py
antlr-runtime-3.0.1.jar        datanucleus-enhancer-2.0.3.jar        hive-exec-0.8.1.jar           jline-0.9.94.jar      slf4j-api-1.6.1.jar
asm-3.1.jar                    datanucleus-rdbms-2.0.3.jar           hive-hbase-handler-0.8.1.jar  json-20090211.jar     slf4j-log4j12-1.6.1.jar
commons-cli-1.2.jar            derby-10.4.2.0.jar                    hive-hwi-0.8.1.jar            junit-4.10.jar        stringtemplate-3.1-b1.jar
commons-codec-1.3.jar          guava-r06.jar                         hive-hwi-0.8.1.war            libfb303-0.7.0.jar    velocity-1.5.jar
commons-collections-3.2.1.jar  hbase-0.89.0-SNAPSHOT.jar             hive-jdbc-0.8.1.jar           libfb303.jar          zookeeper-3.3.1.jar
commons-dbcp-1.4.jar           hbase-0.89.0-SNAPSHOT-tests.jar       hive-metastore-0.8.1.jar      libthrift-0.7.0.jar
commons-lang-2.4.jar           hive-anttasks-0.8.1.jar               hive-pdk-0.8.1.jar            libthrift.jar
commons-logging-1.0.4.jar      hive-builtins-0.8.1.jar               hive-serde-0.8.1.jar          log4j-1.2.15.jar
commons-logging-api-1.0.4.jar  hive-cli-0.8.1.jar                    hive-service-0.8.1.jar        log4j-1.2.16.jar

Also I found a derby.log in my home directory which I have attached.

Regards,

Z

From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 31 October 2012 21:58
To: user@hive.apache.org
Subject: Re: Creating Indexes

Hi Peter,

>From the execution log,

java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver
          at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
          at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
          at java.security.AccessController.doPrivileged(Native Method)
          at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
          at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
          at java.lang.Class.forName0(Native Method)
          at java.lang.Class.forName(Class.java:186)
          at org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:68)
          at org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:778)
          at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:723)
          at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)
          at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
          at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
          at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)
          at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
          at org.apache.hadoop.mapred.Child.main(Child.java:249)

It appears that the error is due derby classes not being found. Can you check if the derby jars are present?

Thanks,
Shreepadma


On Wed, Oct 31, 2012 at 12:52 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi Shreepadma

Happy to attach the logs, not quite sure which one is going to
be most useful. Please find attached one which contained an
error of some sort. Not sure it it's related or not to the index error.
Found the file in this location:

/data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog

so maybe that will help you locate any other file that you might want to see.

Thanks for your efforts.

Peter Marron

From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<ma...@cloudera.com>]
Sent: 31 October 2012 18:38
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Creating Indexes

Hi Peter,

Can you attach the execution logs? What is the exception that you see in the execution logs?

Thanks,
Shreepadma

On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am still having problems building my index.
In an attempt to find someone who can help me
I'll go through all the steps that I try.


1)      First I load my data into hive.

hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;
Loading data to table default.score
Deleted hdfs://localhost/data/warehouse/score
OK
Time taken: 7.817 seconds


2)      Then I try to create the index

hive> CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
FAILED: Error in metadata: java.lang.RuntimeException: Please specify deferred rebuild using " WITH DEFERRED REBUILD ".
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>


3)      OK, so it suggests that I use "DEFERRED BUILD" and so I do
hive>
    >
    > CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > WITH DEFERRED REBUILD;
OK
Time taken: 0.603 seconds


4)      Now, to create the index I assume that I use ALTER INDEX as follows:

hive>ALTER INDEX bigIndex ON score REBUILD;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 138
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201210311448_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001
Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001
Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138
2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%


5)      This all looks promising, and after increasing my heapsize to get the Map/Reduce to complete, I get this an hour later

2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4135.47 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds 470 msec
Ended Job = job_201210311448_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

So what have I done wrong, and what am I to do to get this index to build successfully?

Any help appreciated.

Peter Marron

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com<ma...@trilliumsoftware.com>]
Sent: 24 October 2012 13:27
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Indexes

Hi Shreepadma,

Thanks for this. Looks exactly like the information I need.
I was going to reply when I had tried it all out, but I'm having
problems creating the index at the moment (I'm getting an
OutOfMemoryError at the moment). So I thought that I had
better reply now to say thank you.

Peter Marron


From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 23 October 2012 19:49
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Indexes

Hi Peter,

Indexing support was added to Hive in 0.7 and in 0.8 the query compiler was enhanced to optimized some class of queries (certain group bys and joins) using indexes. Assuming you are using the built in index handler you need to do the following _after_ you have created and rebuilt the index,

SET hive.index.compact.file='/tmp/index_result';
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

You will then notice speed up for a query of the form,

select count(*) from tab where indexed_col = some_val

Thanks,
Shreepadma

On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I'm very much a Hive newbie but I've been looking at HIVE-417 and this page in particular:
http://cwiki.apache.org/confluence/display/Hive/IndexDev
Using this information I've been able to create an index (using Hive 0.8.1)
and when I look at the contents it all looks very promising indeed.
However on the same page there's this comment:

"...This document currently only covers index creation and maintenance. A follow-on will explain how indexes are used to optimize queries (building on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>)...."

However I can't find the "follow-on" which tells me how to exploit the index that I've
created to "optimize" subsequent queries.
Now I've been told that I can create and use indexes with the current
release of Hive _without_ writing and developing any Java code of my own.
Is this true? If so, how?

Any help appreciated.

Peter Marron.




Re: Creating Indexes

Posted by Shreepadma Venugopalan <sh...@cloudera.com>.
Hi Peter,

>From the execution log,

java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:186)
at
org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:68)
at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.publishStats(FileSinkOperator.java:778)
at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

It appears that the error is due derby classes not being found. Can you
check if the derby jars are present?

Thanks,
Shreepadma


On Wed, Oct 31, 2012 at 12:52 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi Shreepadma****
>
> ** **
>
> Happy to attach the logs, not quite sure which one is going to****
>
> be most useful. Please find attached one which contained an****
>
> error of some sort. Not sure it it’s related or not to the index error.***
> *
>
> Found the file in this location:****
>
> ** **
>
>
> /data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog
> ****
>
> ** **
>
> so maybe that will help you locate any other file that you might want to
> see.****
>
> ** **
>
> Thanks for your efforts.****
>
> ** **
>
> Peter Marron****
>
> ** **
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
> *Sent:* 31 October 2012 18:38
> *To:* user@hive.apache.org
> *Subject:* Re: Creating Indexes****
>
> ** **
>
> Hi Peter,****
>
> ** **
>
> Can you attach the execution logs? What is the exception that you see in
> the execution logs?****
>
> ** **
>
> Thanks,****
>
> Shreepadma ****
>
> ** **
>
> On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I am still having problems building my index.****
>
> In an attempt to find someone who can help me****
>
> I’ll go through all the steps that I try.****
>
>  ****
>
> 1)      First I load my data into hive.****
>
>  ****
>
> hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;****
>
> Loading data to table default.score****
>
> Deleted hdfs://localhost/data/warehouse/score****
>
> OK****
>
> Time taken: 7.817 seconds****
>
>  ****
>
> 2)      Then I try to create the index ****
>
>  ****
>
> hive> CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';***
> *
>
> FAILED: Error in metadata: java.lang.RuntimeException: Please specify
> deferred rebuild using " WITH DEFERRED REBUILD ".****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> hive> ****
>
>  ****
>
> 3)      OK, so it suggests that I use “DEFERRED BUILD” and so I do****
>
> hive> ****
>
>     > ****
>
>     > CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'****
>
>     > WITH DEFERRED REBUILD;****
>
> OK****
>
> Time taken: 0.603 seconds****
>
>  ****
>
> 4)      Now, to create the index I assume that I use ALTER INDEX as
> follows:****
>
>  ****
>
> hive>ALTER INDEX bigIndex ON score REBUILD;****
>
> Total MapReduce jobs = 1****
>
> Launching Job 1 out of 1****
>
> Number of reduce tasks not specified. Estimated from input data size: 138*
> ***
>
> In order to change the average load for a reducer (in bytes):****
>
>   set hive.exec.reducers.bytes.per.reducer=<number>****
>
> In order to limit the maximum number of reducers:****
>
>   set hive.exec.reducers.max=<number>****
>
> In order to set a constant number of reducers:****
>
>   set mapred.reduce.tasks=<number>****
>
> Starting Job = job_201210311448_0001, Tracking URL =
> http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001****
>
> Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job
> -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001****
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> 2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%****
>
>  ****
>
> 5)      This all looks promising, and after increasing my heapsize to get
> the Map/Reduce to complete, I get this an hour later****
>
>  ****
>
> 2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 4135.47 sec****
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds
> 470 msec****
>
> Ended Job = job_201210311448_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
>  ****
>
> So what have I done wrong, and what am I to do to get this index to build
> successfully?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 24 October 2012 13:27
> *To:* user@hive.apache.org
> *Subject:* RE: Indexes****
>
>  ****
>
> Hi Shreepadma,****
>
>  ****
>
> Thanks for this. Looks exactly like the information I need.****
>
> I was going to reply when I had tried it all out, but I’m having****
>
> problems creating the index at the moment (I’m getting an****
>
> OutOfMemoryError at the moment). So I thought that I had****
>
> better reply now to say thank you.****
>
>  ****
>
> Peter Marron****
>
>  ****
>
>  ****
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<sh...@cloudera.com>]
>
> *Sent:* 23 October 2012 19:49
> *To:* user@hive.apache.org
> *Subject:* Re: Indexes****
>
>  ****
>
> Hi Peter,****
>
>  ****
>
> Indexing support was added to Hive in 0.7 and in 0.8 the query compiler
> was enhanced to optimized some class of queries (certain group bys and
> joins) using indexes. Assuming you are using the built in index handler you
> need to do the following _after_ you have created and rebuilt the index,**
> **
>
>  ****
>
> SET hive.index.compact.file='/tmp/index_result';****
>
> SET
> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
> ****
>
>  ****
>
> You will then notice speed up for a query of the form,****
>
>  ****
>
> select count(*) from tab where indexed_col = some_val****
>
>  ****
>
> Thanks,****
>
> Shreepadma****
>
>  ****
>
> On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I’m very much a Hive newbie but I’ve been looking at HIVE-417 and this
> page in particular:****
>
> http://cwiki.apache.org/confluence/display/Hive/IndexDev****
>
> Using this information I’ve been able to create an index (using Hive 0.8.1)
> ****
>
> and when I look at the contents it all looks very promising indeed.****
>
> However on the same page there’s this comment:****
>
>  ****
>
> “…This document currently only covers index creation and maintenance. A
> follow-on will explain how indexes are used to optimize queries (building
> on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>
> )….”****
>
>  ****
>
> However I can’t find the “follow-on” which tells me how to exploit the
> index that I’ve****
>
> created to “optimize” subsequent queries.****
>
> Now I’ve been told that I can create and use indexes with the current****
>
> release of Hive _*without*_ writing and developing any Java code of my
> own.****
>
> Is this true? If so, how?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron.****
>
>  ****
>
> ** **
>

RE: Creating Indexes

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Hi Shreepadma

Happy to attach the logs, not quite sure which one is going to
be most useful. Please find attached one which contained an
error of some sort. Not sure it it's related or not to the index error.
Found the file in this location:

/data/hadoop/logs/userlogs/job_201210311448_0001/attempt_201210311448_0001_r_000137_0/syslog

so maybe that will help you locate any other file that you might want to see.

Thanks for your efforts.

Peter Marron

From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 31 October 2012 18:38
To: user@hive.apache.org
Subject: Re: Creating Indexes

Hi Peter,

Can you attach the execution logs? What is the exception that you see in the execution logs?

Thanks,
Shreepadma

On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am still having problems building my index.
In an attempt to find someone who can help me
I'll go through all the steps that I try.


1)      First I load my data into hive.

hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;
Loading data to table default.score
Deleted hdfs://localhost/data/warehouse/score
OK
Time taken: 7.817 seconds


2)      Then I try to create the index

hive> CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
FAILED: Error in metadata: java.lang.RuntimeException: Please specify deferred rebuild using " WITH DEFERRED REBUILD ".
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive>


3)      OK, so it suggests that I use "DEFERRED BUILD" and so I do
hive>
    >
    > CREATE INDEX bigIndex
    > ON TABLE score(Ath_Seq_Num)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > WITH DEFERRED REBUILD;
OK
Time taken: 0.603 seconds


4)      Now, to create the index I assume that I use ALTER INDEX as follows:

hive>ALTER INDEX bigIndex ON score REBUILD;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 138
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201210311448_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001
Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001
Hadoop job information for Stage-1: number of mappers: 511; number of reducers: 138
2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%


5)      This all looks promising, and after increasing my heapsize to get the Map/Reduce to complete, I get this an hour later

2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4135.47 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds 470 msec
Ended Job = job_201210311448_0001
Loading data to table default.default__score_bigindex__
Deleted hdfs://localhost/data/warehouse/default__score_bigindex__
Invalid alter operation: Unable to alter index.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

So what have I done wrong, and what am I to do to get this index to build successfully?

Any help appreciated.

Peter Marron

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com<ma...@trilliumsoftware.com>]
Sent: 24 October 2012 13:27
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Indexes

Hi Shreepadma,

Thanks for this. Looks exactly like the information I need.
I was going to reply when I had tried it all out, but I'm having
problems creating the index at the moment (I'm getting an
OutOfMemoryError at the moment). So I thought that I had
better reply now to say thank you.

Peter Marron


From: Shreepadma Venugopalan [mailto:shreepadma@cloudera.com]
Sent: 23 October 2012 19:49
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Indexes

Hi Peter,

Indexing support was added to Hive in 0.7 and in 0.8 the query compiler was enhanced to optimized some class of queries (certain group bys and joins) using indexes. Assuming you are using the built in index handler you need to do the following _after_ you have created and rebuilt the index,

SET hive.index.compact.file='/tmp/index_result';
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

You will then notice speed up for a query of the form,

select count(*) from tab where indexed_col = some_val

Thanks,
Shreepadma

On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I'm very much a Hive newbie but I've been looking at HIVE-417 and this page in particular:
http://cwiki.apache.org/confluence/display/Hive/IndexDev
Using this information I've been able to create an index (using Hive 0.8.1)
and when I look at the contents it all looks very promising indeed.
However on the same page there's this comment:

"...This document currently only covers index creation and maintenance. A follow-on will explain how indexes are used to optimize queries (building on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>)...."

However I can't find the "follow-on" which tells me how to exploit the index that I've
created to "optimize" subsequent queries.
Now I've been told that I can create and use indexes with the current
release of Hive _without_ writing and developing any Java code of my own.
Is this true? If so, how?

Any help appreciated.

Peter Marron.



Re: Creating Indexes

Posted by Shreepadma Venugopalan <sh...@cloudera.com>.
Hi Peter,

Can you attach the execution logs? What is the exception that you see in
the execution logs?

Thanks,
Shreepadma

On Wed, Oct 31, 2012 at 10:42 AM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,****
>
> ** **
>
> I am still having problems building my index.****
>
> In an attempt to find someone who can help me****
>
> I’ll go through all the steps that I try.****
>
> ** **
>
> **1)      **First I load my data into hive.****
>
> ** **
>
> hive> LOAD DATA INPATH 'E3/score.csv' OVERWRITE INTO TABLE score;****
>
> Loading data to table default.score****
>
> Deleted hdfs://localhost/data/warehouse/score****
>
> OK****
>
> Time taken: 7.817 seconds****
>
> ** **
>
> **2)      **Then I try to create the index ****
>
> ** **
>
> hive> CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';***
> *
>
> FAILED: Error in metadata: java.lang.RuntimeException: Please specify
> deferred rebuild using " WITH DEFERRED REBUILD ".****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> hive> ****
>
> ** **
>
> **3)      **OK, so it suggests that I use “DEFERRED BUILD” and so I do****
>
> hive> ****
>
>     > ****
>
>     > CREATE INDEX bigIndex****
>
>     > ON TABLE score(Ath_Seq_Num)****
>
>     > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'****
>
>     > WITH DEFERRED REBUILD;****
>
> OK****
>
> Time taken: 0.603 seconds****
>
> ** **
>
> **4)      **Now, to create the index I assume that I use ALTER INDEX as
> follows:****
>
> ** **
>
> hive>ALTER INDEX bigIndex ON score REBUILD;****
>
> Total MapReduce jobs = 1****
>
> Launching Job 1 out of 1****
>
> Number of reduce tasks not specified. Estimated from input data size: 138*
> ***
>
> In order to change the average load for a reducer (in bytes):****
>
>   set hive.exec.reducers.bytes.per.reducer=<number>****
>
> In order to limit the maximum number of reducers:****
>
>   set hive.exec.reducers.max=<number>****
>
> In order to set a constant number of reducers:****
>
>   set mapred.reduce.tasks=<number>****
>
> Starting Job = job_201210311448_0001, Tracking URL =
> http://localhost:50030/jobdetails.jsp?jobid=job_201210311448_0001****
>
> Kill Command = /data/hadoop-1.0.3/libexec/../bin/hadoop job
> -Dmapred.job.tracker=localhost:8021 -kill job_201210311448_0001****
>
> Hadoop job information for Stage-1: number of mappers: 511; number of
> reducers: 138****
>
> 2012-10-31 15:59:27,076 Stage-1 map = 0%,  reduce = 0%****
>
> ** **
>
> **5)      **This all looks promising, and after increasing my heapsize to
> get the Map/Reduce to complete, I get this an hour later****
>
> ** **
>
> 2012-10-31 17:08:23,572 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 4135.47 sec****
>
> MapReduce Total cumulative CPU time: 0 days 1 hours 8 minutes 55 seconds
> 470 msec****
>
> Ended Job = job_201210311448_0001****
>
> Loading data to table default.default__score_bigindex__****
>
> Deleted hdfs://localhost/data/warehouse/default__score_bigindex__****
>
> Invalid alter operation: Unable to alter index.****
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask****
>
> ** **
>
> So what have I done wrong, and what am I to do to get this index to build
> successfully?****
>
> ** **
>
> Any help appreciated.****
>
> ** **
>
> Peter Marron****
>
> ** **
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 24 October 2012 13:27
> *To:* user@hive.apache.org
> *Subject:* RE: Indexes****
>
> ** **
>
> Hi Shreepadma,****
>
> ** **
>
> Thanks for this. Looks exactly like the information I need.****
>
> I was going to reply when I had tried it all out, but I’m having****
>
> problems creating the index at the moment (I’m getting an****
>
> OutOfMemoryError at the moment). So I thought that I had****
>
> better reply now to say thank you.****
>
> ** **
>
> Peter Marron****
>
> ** **
>
> ** **
>
> *From:* Shreepadma Venugopalan [mailto:shreepadma@cloudera.com<sh...@cloudera.com>]
>
> *Sent:* 23 October 2012 19:49
> *To:* user@hive.apache.org
> *Subject:* Re: Indexes****
>
> ** **
>
> Hi Peter,****
>
> ** **
>
> Indexing support was added to Hive in 0.7 and in 0.8 the query compiler
> was enhanced to optimized some class of queries (certain group bys and
> joins) using indexes. Assuming you are using the built in index handler you
> need to do the following _after_ you have created and rebuilt the index,**
> **
>
> ** **
>
> SET hive.index.compact.file='/tmp/index_result';****
>
> SET
> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
> ****
>
> ** **
>
> You will then notice speed up for a query of the form,****
>
> ** **
>
> select count(*) from tab where indexed_col = some_val****
>
> ** **
>
> Thanks,****
>
> Shreepadma****
>
> ** **
>
> On Tue, Oct 23, 2012 at 5:44 AM, Peter Marron <
> Peter.Marron@trilliumsoftware.com> wrote:****
>
> Hi,****
>
>  ****
>
> I’m very much a Hive newbie but I’ve been looking at HIVE-417 and this
> page in particular:****
>
> http://cwiki.apache.org/confluence/display/Hive/IndexDev****
>
> Using this information I’ve been able to create an index (using Hive 0.8.1)
> ****
>
> and when I look at the contents it all looks very promising indeed.****
>
> However on the same page there’s this comment:****
>
>  ****
>
> “…This document currently only covers index creation and maintenance. A
> follow-on will explain how indexes are used to optimize queries (building
> on FilterPushdownDev<https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev>
> )….”****
>
>  ****
>
> However I can’t find the “follow-on” which tells me how to exploit the
> index that I’ve****
>
> created to “optimize” subsequent queries.****
>
> Now I’ve been told that I can create and use indexes with the current****
>
> release of Hive _*without*_ writing and developing any Java code of my
> own.****
>
> Is this true? If so, how?****
>
>  ****
>
> Any help appreciated.****
>
>  ****
>
> Peter Marron.****
>
> ** **
>