You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Robert James <sr...@gmail.com> on 2016/07/04 13:49:47 UTC

Understanding HBase random reads

I'd like to understand HBase block reads better.  Assume my HBase
block is 64KB and my HDFS block is 64MB.

I've read that HBase can just do a random read of the 64KB block,
without reading the 64MB HDFS block.  Given that HDFS doesn't support
random reads within a block, how is that possible? Is this only true
if the HDFS block is cached (either mem or disk, but outside of HDFS)?
Or does HBase somehow short circuit and go directly to OS, bypassing
HDFS because it knows HDFS internals?

Depending on the above: Aside from HBase block compression, should I
use HDFS block compression? If HDFS compression prevents HBase from
doing a random read, I most certainly do _not_ want to use it.  But if
HBase can't do a random read to HDFS, then I want to use HDFS block
compression, because you can compress a 64 MB block much better than a
64 KB block.

Re: Understanding HBase random reads

Posted by Stack <st...@duboce.net>.

On Mon, Jul 4, 2016 at 6:49 AM, Robert James <sr...@gmail.com> wrote:

> I'd like to understand HBase block reads better.  Assume my HBase
> block is 64KB and my HDFS block is 64MB.
>
> I've read that HBase can just do a random read of the 64KB block,
> without reading the 64MB HDFS block.

Thats right.

> Given that HDFS doesn't support
> random reads within a block, how is that possible?

It does support reading at an explicit offset. See [1] and the pread method
that follows.

> Or does HBase somehow short circuit and go directly to OS, bypassing
> HDFS because it knows HDFS internals?
>
>
There is also a 'short circuit' read facility, yes, that makes the read
less costly if the block is local [2].

> Depending on the above: Aside from HBase block compression, should I
> use HDFS block compression? If HDFS compression prevents HBase from
> doing a random read, I most certainly do _not_ want to use it.  But if
> HBase can't do a random read to HDFS, then I want to use HDFS block
> compression, because you can compress a 64 MB block much better than a
> 64 KB block.
>

I've not played with it but my guess is that HDFS compression would be
transparent to HBase but that the cost of seek to a particular offset would
require our decompressing all of the HDFS block up to the particular read
point.

You could enable hbase compression; the HBase blocks will be compressed.

Regards 'much better' compression, which compressor are you thinking off?
When I looked last, a long time ago admittedly, the likes of gzip worked on
chunks considerably smaller than an HDFS block.

Thanks,
St.Ack

1.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/2.7.1/org/apache/hadoop/hdfs/DFSInputStream.java#DFSInputStream.read%28long%2Cbyte%5B%5D%2Cint%2Cint%29
2.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html