You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Asher Devuyst <as...@gmail.com> on 2015/03/23 18:25:33 UTC

Bulk Loading

Running HBase 0.98.4, Hadoop 2.6.0

I have some sequence files that were stored in HDFS, let's say in
/user/me/seq

I have a job that reads in the sequence files and writes out the KeyValues
in HFileOutputFormat to an output directory, let's say in /user/me/hfiles.
The job is configured to bulk load the data to HBase:

HFileOutputFormat.configureIncrementalLoad(job, hTable)

FileOutputFormat.setOutputPath(job, outputPath)

The job runs, success all around for the job as a whole and for all the
tasks.

Now when I check the output path: hdfs dfs -du -s -h hfiles/*

I have a directory in there called mytable that has hfiles in it and it's
168.3 G.

When I check the /apps/hbase/data/data/default/mytable in HBase it has a
size of 14.5 G

Even with compression on the table (Snappy) and the HFile not being
compressed, it seems like a factor of 10 is questionable.

My question is:  The job ran and was successful according to the status,
all tasks, etc.  It seems like something is not adding up here.  Should the
output directory for the job have been the hbase table directory instead,
or did it fail silently to import the data? Any clues in the logs I should
look for?

I did a scan of the table and data that should be there is not being
returned.

The only hint of an issue is that in the counters for the reduce tasks
there is a counter called: IO_ERROR w/ a value > 0 for all of them.  It
seems to me like it should fail the task instead of failing silently here.
Any idea what's going on here?

Thanks,

Asher