You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bruce Bian <we...@gmail.com> on 2012/02/10 03:39:52 UTC

Re: HFileInputFormat for MapReduce

I also encountered this issue when comparing Hive+HBase with
Hive+HDFS(native hive tables). After some tuning(ensure data locality,
using scan cache,appropriate number of mappers per node etc), Hive+HBase is
around 4~5X slower.
I guess the two main reasons are :
1) HFile repeats keys for each K/V pair, thus more redundant than sequence
files in native hive tables.(in my case, the same table is ~5X in HBase
than in Hive flat files)

2) An additional layer of RPC brought by the HBase API.Tatsuya did a test
of reading HDFS directly and claims it to be ~2.5X faster.(
https://github.com/tatsuya6502/hbase-mr-pof). Thus implementing
HFileInputFormat can be promising if the pitfalls mentioned are tolerable.

Currently we're adopting the periodic exporting HBase to HDFS approach
though, as we need both good performance for random read/write and analysis
jobs.