You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Zheng Da <zh...@gmail.com> on 2012/01/31 02:27:37 UTC

the size of a value and the block size.

Hello,

I'm thinking of using HBase to store a matrix, so each subblock of a matrix
is stored as a value in HBase, and the key of the value is the location of
the subblock in the matrix. At beginning, I wanted the subblock to be as
large as 8MB. But when I read
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html, I
found HBase splits keyvalue pairs into blocks and the block size is usually
much smaller than 8MB. So what happens if I store data of 8MB as a value in
HBase? I tried, and it seems to work fine. But how about the performance?

Thanks,
Da

Re: the size of a value and the block size.

Posted by Stack <st...@duboce.net>.
On Tue, Jan 31, 2012 at 7:30 PM, Zheng Da <zh...@gmail.com> wrote:
> It mentions "block size" and also the figure shows data is split into
> blocks and each block starts with a magic header, which shows whether data
> in the block is compressed or not. Also blocks in HBase is indexed.
>

These 'blocks' are not hdfs 'blocks'.  The hbase hfile that we write
to hdfs is written in, by default, 64k chunks/blocks (This is the same
as the read-time blocks as I talked of in my earlier message).  As
said already, these are not hdfs blocks (this blocking is done on top
of hdfs blocking).

> "Minimum block size. We recommend a setting of minimum block size between
> 8KB to 1MB for general usage. Larger block size is preferred if files are
> primarily for sequential access. However, it would lead to inefficient
> random access (because there are more data to decompress). Smaller blocks
> are good for random access, but require more memory to hold the block
> index, and may be slower to create (because we must flush the compressor
> stream at the conclusion of each data block, which leads to an FS I/O
> flush). Further, due to the internal caching in Compression codec, the
> smallest possible block size would be around 20KB-30KB."
> So each block with its prefixed "magic" header contains either plain or
> compressed data. How that looks like we will have a look at in the next
> section.
>
> If data isn't split into blocks, how do these things work?
>

The above prescription rings about right (you should be referring to
the reference guide though rather than to Lars' blog; See
http://hbase.apache.org/book.html#hfilev2  It builds on Lars blog to
explain how hfile works in more recent hbases').  It pertains to the
hfile blocks.

I don't understand your question 'If data isn't split into blocks, how
do these things work?'

Data is split into hfile blocks.   Splits happen on hfile block
boundaries usually.

Please ask more questions so I can help you understand whats going on.

St.Ack

Re: the size of a value and the block size.

Posted by Zheng Da <zh...@gmail.com>.
Hello,

On Tue, Jan 31, 2012 at 3:45 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jan 30, 2012 at 5:27 PM, Zheng Da <zh...@gmail.com> wrote:
> > Hello,
> >
> > I'm thinking of using HBase to store a matrix, so each subblock of a
> matrix
> > is stored as a value in HBase, and the key of the value is the location
> of
> > the subblock in the matrix. At beginning, I wanted the subblock to be as
> > large as 8MB. But when I read
> > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html, I
> > found HBase splits keyvalue pairs into blocks and the block size is
> usually
> > much smaller than 8MB. So what happens if I store data of 8MB as a value
> in
> > HBase? I tried, and it seems to work fine. But how about the performance?
> >
>
> Please point to what in that blog has you thinking we split keyvalues.
>  We do not.
>
It mentions "block size" and also the figure shows data is split into
blocks and each block starts with a magic header, which shows whether data
in the block is compressed or not. Also blocks in HBase is indexed.

"Minimum block size. We recommend a setting of minimum block size between
8KB to 1MB for general usage. Larger block size is preferred if files are
primarily for sequential access. However, it would lead to inefficient
random access (because there are more data to decompress). Smaller blocks
are good for random access, but require more memory to hold the block
index, and may be slower to create (because we must flush the compressor
stream at the conclusion of each data block, which leads to an FS I/O
flush). Further, due to the internal caching in Compression codec, the
smallest possible block size would be around 20KB-30KB."
So each block with its prefixed "magic" header contains either plain or
compressed data. How that looks like we will have a look at in the next
section.

If data isn't split into blocks, how do these things work?

>
> Writing, we persist files that by default use hdfs blocks of 64MB.
> Reading we will by default read in 64k chunks (hbase read blocks).
> The 64k will contain whole keyvalues which means we likely rarely read
> exactly 64kb.  If a keyvalue is 8MB, though we're configured to read
> in 64kb blocks, we'll read in the coherent 8MB keyvalue as a block.
>
> Performance-wise, its best you try it out.  Be aware that unless you
> configure stuff otherwise, this 8MB block coming up out of the
> filesystem will probably traverse the read-side block cache and blow
> out a bunch of lesser entries.  These are the kind of things you'll
> need to think consider.  Check out the performance section in the
> hbase reference guide: http://hbase.apache.org/book.html#performance


Thanks,
Da

Re: the size of a value and the block size.

Posted by Stack <st...@duboce.net>.
On Mon, Jan 30, 2012 at 5:27 PM, Zheng Da <zh...@gmail.com> wrote:
> Hello,
>
> I'm thinking of using HBase to store a matrix, so each subblock of a matrix
> is stored as a value in HBase, and the key of the value is the location of
> the subblock in the matrix. At beginning, I wanted the subblock to be as
> large as 8MB. But when I read
> http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html, I
> found HBase splits keyvalue pairs into blocks and the block size is usually
> much smaller than 8MB. So what happens if I store data of 8MB as a value in
> HBase? I tried, and it seems to work fine. But how about the performance?
>

Please point to what in that blog has you thinking we split keyvalues.
 We do not.

Writing, we persist files that by default use hdfs blocks of 64MB.
Reading we will by default read in 64k chunks (hbase read blocks).
The 64k will contain whole keyvalues which means we likely rarely read
exactly 64kb.  If a keyvalue is 8MB, though we're configured to read
in 64kb blocks, we'll read in the coherent 8MB keyvalue as a block.

Performance-wise, its best you try it out.  Be aware that unless you
configure stuff otherwise, this 8MB block coming up out of the
filesystem will probably traverse the read-side block cache and blow
out a bunch of lesser entries.  These are the kind of things you'll
need to think consider.  Check out the performance section in the
hbase reference guide: http://hbase.apache.org/book.html#performance

St.Ack