You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Brian Salazar <br...@gmail.com> on 2011/02/04 20:35:54 UTC

Hive bulk load into HBase

I have been using the Bulk Load example here:
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

I am having an issue with a bulk load of 1 million records into HBase on a
cluster of 6 using Hive.

Hive 0.6.0 (built from source to get UDFRowSequence)
Hadoop 0.20.2
HBase 0.20.6
Zookeeper 3.3.2

hive> desc cdata_dump;
OK
uid     string
retail_cat_name1        string
retail_cat_name2        string
retail_cat_name3        string
bread_crumb_csv string
Time taken: 4.194 seconds

Now my issue:

hive> set mapred.reduce.tasks=1;
hive> create temporary function row_sequence as
    > 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
OK
Time taken: 0.0080 seconds

hive> select uid from
    > (select uid
    > from cdata_dump
    > tablesample(bucket 1 out of 1000 on uid) s
    > order by uid
    > limit 1000) x
    > where (row_sequence() % 100000)=0
    > order by uid
    > limit 9;
11/02/04 19:25:21 INFO parse.ParseDriver: Parsing command: select uid from
(select uid
from cdata_dump
tablesample(bucket 1 out of 1000 on uid) s
order by uid
limit 1000) x
where (row_sequence() % 100000)=0
order by uid
limit 9
11/02/04 19:25:21 INFO parse.ParseDriver: Parse Completed
11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic
Analysis
11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source
tables
11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for subqueries
11/02/04 19:25:21 INFO parse.SemanticAnalyzer: Get metadata for source
tables
11/02/04 19:25:21 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
11/02/04 19:25:21 INFO metastore.ObjectStore: ObjectStore, initialize called
11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.resources" but it cannot be resolved.
11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
requires "org.eclipse.core.runtime" but it cannot be resolved.
11/02/04 19:25:22 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
requires "org.eclipse.text" but it cannot be resolved.
11/02/04 19:25:23 INFO metastore.ObjectStore: Setting MetaStore object pin
classes with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
11/02/04 19:25:23 INFO metastore.ObjectStore: Initialized ObjectStore
11/02/04 19:25:24 INFO metastore.HiveMetaStore: 0: get_table : db=default
tbl=cdata_dump
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for subqueries
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for destination
tables
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Get metadata for destination
tables
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed getting MetaData in
Semantic Analysis
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Need sample filter
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: hashfnExpr = class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid]()
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: andExpr = class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const
int 2147483647()
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: modExpr = class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const
int 2147483647(), Const int 1000()
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: numeratorExpr = Const int 0
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: equalsExpr = class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge(class
org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(Column[uid](), Const
int 2147483647(), Const int 1000(), Const int 0()
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FS(11)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(10)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(9)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(8)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(7)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(6)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for LIM(5)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for OP(4)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for RS(3)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for SEL(2)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for FIL(1)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of FIL For
Alias : s
11/02/04 19:25:25 INFO ppd.OpProcFactory:       (((hash(uid) & 2147483647) %
1000) = 0)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Processing for TS(0)
11/02/04 19:25:25 INFO ppd.OpProcFactory: Pushdown Predicates of TS For
Alias : s
11/02/04 19:25:25 INFO ppd.OpProcFactory:       (((hash(uid) & 2147483647) %
1000) = 0)
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO hive.log: DDL: struct cdata_dump { string uid, string
retail_cat_name1, string retail_cat_name2, string retail_cat_name3, string
bread_crumb_csv}
11/02/04 19:25:25 INFO parse.SemanticAnalyzer: Completed plan generation
11/02/04 19:25:25 INFO ql.Driver: Semantic Analysis Completed
11/02/04 19:25:25 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:[FieldSchema(name:uid, type:string, comment:null)],
properties:null)
11/02/04 19:25:25 INFO ql.Driver: Starting command: select uid from
(select uid
from cdata_dump
tablesample(bucket 1 out of 1000 on uid) s
order by uid
limit 1000) x
where (row_sequence() % 100000)=0
order by uid
limit 9
Total MapReduce jobs = 2
11/02/04 19:25:25 INFO ql.Driver: Total MapReduce jobs = 2
Launching Job 1 out of 2
11/02/04 19:25:26 INFO ql.Driver: Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 1
11/02/04 19:25:26 INFO exec.MapRedTask: Number of reduce tasks determined at
compile time: 1
In order to change the average load for a reducer (in bytes):
11/02/04 19:25:26 INFO exec.MapRedTask: In order to change the average load
for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
11/02/04 19:25:26 INFO exec.MapRedTask:   set
hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
11/02/04 19:25:26 INFO exec.MapRedTask: In order to limit the maximum number
of reducers:
  set hive.exec.reducers.max=<number>
11/02/04 19:25:26 INFO exec.MapRedTask:   set
hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
11/02/04 19:25:26 INFO exec.MapRedTask: In order to set a constant number of
reducers:
  set mapred.reduce.tasks=<number>
11/02/04 19:25:26 INFO exec.MapRedTask:   set mapred.reduce.tasks=<number>
11/02/04 19:25:26 INFO exec.MapRedTask: Using
org.apache.hadoop.hive.ql.io.HiveInputFormat
11/02/04 19:25:26 INFO exec.MapRedTask: adding libjars:
file:///home/hadoop/hive/build/dist/lib/hive_hbase-handler.jar,file:///usr/local/hadoop-0.20.2/zookeeper-3.3.2/zookeeper-3.3.2.jar,file:///usr/local/hadoop-0.20.2/hbase-0.20.6/hbase-0.20.6.jar
11/02/04 19:25:26 INFO exec.MapRedTask: Processing alias x:s
11/02/04 19:25:26 INFO exec.MapRedTask: Adding input file
hdfs://hadoop-1:54310/user/hive/warehouse/cdata_dump
11/02/04 19:25:26 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/02/04 19:25:26 INFO mapred.FileInputFormat: Total input paths to process
: 1
Starting Job = job_201102040059_0016, Tracking URL =
http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016<http://hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016>
11/02/04 19:25:27 INFO exec.MapRedTask: Starting Job =
job_201102040059_0016, Tracking URL =
http://Hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016<http://hadoop-1:50030/jobdetails.jsp?jobid=job_201102040059_0016>
Kill Command = /usr/local/hadoop-0.20.2/bin/../bin/hadoop job
 -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016
11/02/04 19:25:27 INFO exec.MapRedTask: Kill Command =
/usr/local/hadoop-0.20.2/bin/../bin/hadoop job
 -Dmapred.job.tracker=hadoop-1:54311 -kill job_201102040059_0016
2011-02-04 19:25:32,266 Stage-1 map = 0%,  reduce = 0%
11/02/04 19:25:32 INFO exec.MapRedTask: 2011-02-04 19:25:32,266 Stage-1 map
= 0%,  reduce = 0%
2011-02-04 19:25:38,304 Stage-1 map = 100%,  reduce = 0%
11/02/04 19:25:38 INFO exec.MapRedTask: 2011-02-04 19:25:38,304 Stage-1 map
= 100%,  reduce = 0%
2011-02-04 19:25:47,354 Stage-1 map = 100%,  reduce = 33%
11/02/04 19:25:47 INFO exec.MapRedTask: 2011-02-04 19:25:47,354 Stage-1 map
= 100%,  reduce = 33%
2011-02-04 19:25:50,377 Stage-1 map = 100%,  reduce = 0%
11/02/04 19:25:50 INFO exec.MapRedTask: 2011-02-04 19:25:50,377 Stage-1 map
= 100%,  reduce = 0%
2011-02-04 19:25:59,429 Stage-1 map = 100%,  reduce = 33%
11/02/04 19:25:59 INFO exec.MapRedTask: 2011-02-04 19:25:59,429 Stage-1 map
= 100%,  reduce = 33%
2011-02-04 19:26:02,445 Stage-1 map = 100%,  reduce = 0%
11/02/04 19:26:02 INFO exec.MapRedTask: 2011-02-04 19:26:02,445 Stage-1 map
= 100%,  reduce = 0%
2011-02-04 19:26:11,484 Stage-1 map = 100%,  reduce = 33%
11/02/04 19:26:11 INFO exec.MapRedTask: 2011-02-04 19:26:11,484 Stage-1 map
= 100%,  reduce = 33%
2011-02-04 19:26:14,498 Stage-1 map = 100%,  reduce = 0%
11/02/04 19:26:14 INFO exec.MapRedTask: 2011-02-04 19:26:14,498 Stage-1 map
= 100%,  reduce = 0%
2011-02-04 19:26:24,537 Stage-1 map = 100%,  reduce = 33%
11/02/04 19:26:24 INFO exec.MapRedTask: 2011-02-04 19:26:24,537 Stage-1 map
= 100%,  reduce = 33%
2011-02-04 19:26:27,549 Stage-1 map = 100%,  reduce = 0%
11/02/04 19:26:27 INFO exec.MapRedTask: 2011-02-04 19:26:27,549 Stage-1 map
= 100%,  reduce = 0%
2011-02-04 19:26:30,563 Stage-1 map = 100%,  reduce = 100%
11/02/04 19:26:30 INFO exec.MapRedTask: 2011-02-04 19:26:30,563 Stage-1 map
= 100%,  reduce = 100%
Ended Job = job_201102040059_0016 with errors
11/02/04 19:26:30 ERROR exec.MapRedTask: Ended Job = job_201102040059_0016
with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask
11/02/04 19:26:30 ERROR ql.Driver: FAILED: Execution Error, return code 2
from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting errors like this in the task log:


2011-02-04 19:25:44,460 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing row (tag=0)
{"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive
Runtime Error while processing row (tag=0)
{"key":{"reducesinkkey0":""},"value":{"_col0":""},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256)
	... 3 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:126)
	at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:80)
	at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77)
	at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:77)
	at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:80)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
	at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:47)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247)
	... 3 more
Caused by: java.lang.NullPointerException
	at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107)
	... 16 more
2011-02-04 19:25:44,463 INFO org.apache.hadoop.mapred.TaskRunner:
Runnning cleanup for the task


Any ideas?


Thanks in advance!

- Brian