You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by William Kang <we...@gmail.com> on 2010/10/19 04:48:16 UTC

HBase random access in HDFS and block indices

Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.

1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?

2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?

3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?

Many thanks. I would be grateful for your answers.


William

RE: HBase random access in HDFS and block indices

Posted by Jonathan Gray <jg...@facebook.com>.

HFiles are generally 256MB and default block size is 64K, so that's 4000 blocks (1/16th what you said).  That would have a more reasonable block index of 200K.

But the block index is kept in-memory so you only read it once when the file is first opened.  So even if you do lower the block size and increase the block index, this would mostly slowdown the initial open but should not have a major impact on random access performance once cached.

The offset is found from reading the headers of the HFile.  Again this only has to be done on open.  Indexes used to be kept in separate files but this doubled the number of open files HBase might need, an issue we are always pushing against.

I'm not sure seeking around a single HFile during the initial load of that file is an especially big issue (when talking just 2-3 areas to read from not 100s).

JG

> -----Original Message-----
> From: Matt Corgan [mailto:mcorgan@hotpads.com]
> Sent: Monday, October 18, 2010 8:53 PM
> To: user
> Subject: Re: HBase random access in HDFS and block indices
> 
> Do you guys ever worry about how big an HFile's index will be?  For
> example,
> if you have a 512mb HFile with 8k block size, you will have 64,000
> blocks.
>  If each index entry is 50b, then you have a 3.2mb index which is way
> out of
> line with your intention of having a small block size.  I believe
> that's
> read all at once so will be slow the first time... is the index cached
> somewhere (block cache?) so that index accesses are from memory?
> 
> And somewhat related - since the index is stored at the end of the
> HFile, is
> an additional random access required to find its offset?  If it was
> considered, why was that chosen over putting it in it's own file that
> could
> be accessed directly?
> 
> Thanks for all these explanations,
> Matt
> 
> 
> On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> 
> > The primary problem is the namenode memory. It contains entries for
> every
> > file and block, so setting hdfs block size small limits your
> scaleability.
> >
> > There is nothing inherently wrong with in file random read, Its just
> That
> > the hdfs client was written for a single reader to read most of a
> file.
> > This to achieve high performance you'd need to do tricks, such as
> > pipelining
> > sockets and socket pool reuse. Right now for random reads We open a
> new
> > socket, read data then close it.
> > On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com>
> wrote:
> > > Hi JG and Ryan,
> > > Thanks for the excellent answers.
> > >
> > > So, I am going to push everything to the extremes without
> considering
> > > the memory first. In theory, if in HBase, every cell size equals to
> > > HBase block size, then there would not be any in block traverse. In
> > > HDFS, very HBase block size equals to each HDFS block size, there
> > > would not be any in-file random access necessary. This would
> provide
> > > the best performance?
> > >
> > > But, the problem is that if the block in HBase is too large, the
> > > memory will run out since HBase load block into memory; if the
> block
> > > in HDFS is too small, the DN will run out of memory since each HDFS
> > > file takes some memory. So, it is a trade-off problem between
> memory
> > > and performance. Is it right?
> > >
> > > And would it make any difference between random reading the same
> size
> > > file portion from of a small HDFS block and from a large HDFS
> block?
> > >
> > > Thanks.
> > >
> > >
> > > William
> > >
> > > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com>
> > wrote:
> > >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang
> <we...@gmail.com>
> > wrote:
> > >>> Hi,
> > >>> Recently I have spent some efforts to try to understand the
> mechanisms
> > >>> of HBase to exploit possible performance tunning options. And
> many
> > >>> thanks to the folks who helped with my questions in this
> community, I
> > >>> have sent a report. But, there are still few questions left.
> > >>>
> > >>> 1. If a HFile block contains more than one keyvalue pair, will
> the
> > >>> block index in HFile point out the offset for every keyvalue pair
> in
> > >>> that block? Or, the block index will just point out the key
> ranges
> > >>> inside that block, so you have to traverse inside the block until
> you
> > >>> meet the key you are looking for?
> > >>
> > >> The block index contains the first key for every block.  It
> therefore
> > >> defines in an [a,b) manner the range of each block. Once a block
> has
> > >> been selected to read from, it is read into memory then iterated
> over
> > >> until the key in question has been found (or the closest match has
> > >> been found).
> > >>
> > >>> 2. When HBase read block to fetching the data or traverse in it,
> is
> > >>> this block read into memory?
> > >>
> > >> yes, the entire block at a time is read in a single read
> operation.
> > >>
> > >>>
> > >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > >>> configurable), to read the HBase blocks, we have to random access
> the
> > >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > >>> portion of the larger HDFS blocks, it is still a random access.
> Would
> > >>> this be slow?
> > >>
> > >> Random access reads are not necessarily slow, they require several
> > things:
> > >> - disk seeks to the data in question
> > >> - disk seeks to the checksum files in question
> > >> - checksum computation and verification
> > >>
> > >> While not particularly slow, this could probably be optimized a
> bit.
> > >>
> > >> Most of the issues with random reads in HDFS is parallelizing the
> > >> reads and doing as much io-pushdown/scheduling as possible without
> > >> consuming an excess of sockets and threads.  The actual speed can
> be
> > >> excellent, or not, depending on how busy the IO subsystem is.
> > >>
> > >>
> > >>>
> > >>> Many thanks. I would be grateful for your answers.
> > >>>
> > >>>
> > >>> William
> > >>>
> > >>
> >

Re: HBase random access in HDFS and block indices

Posted by Stack <st...@duboce.net>.

On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mc...@hotpads.com> wrote:
> I was envisioning the HFiles being opened and closed more often, but it
> sounds like they're held open for long periods and that the indexes are
> permanently cached.  Is it roughly correct to say that after opening an
> HFile and loading its checksum/metadata/index/etc then each random data
> block access only requires a single pread, where the pread has some
> threading and connection overhead, but theoretically only requires one disk
> seek.  I'm curious because I'm trying to do a lot of random reads, and given
> enough application parallelism, the disk seeks should become the bottleneck
> much sooner than the network and threading overhead.
>

You have it basically right.

On region deploy, all files that comprise a region are opened and
thereafter held opened.  Part of opening is reading in index and file
metadata so opened files occupy some memory.  An optimization would be
to let go of unused files reopening on access.

St.Ack

RE: HBase random access in HDFS and block indices

Posted by Michael Segel <mi...@hotmail.com>.

> Date: Fri, 29 Oct 2010 10:01:24 -0700
> Subject: Re: HBase random access in HDFS and block indices
> From: stack@duboce.net
> To: user@hbase.apache.org
> 
> On Fri, Oct 29, 2010 at 6:41 AM, Sean Bigdatafun
> <se...@gmail.com> wrote:
> > I have the same doubt here. Let's say I have a totally random read pattern
> > (uniformly distributed).
> >
> > Now let's assume my total data size stored in HBase is 100TB on 10
> > machines(not a big deal considering nowaday's disks), and the total size of
> > my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> > cache hit probablity. Under random read pattern, each read is bound to
> > experience the "open-> read index -> .... -> read datablock" sequence, which
> > would be expensive.
> >
> > Any comment?
> >
> 
> If totally random, as per Alvin's suggestion, yes, just turn off block
> caching since it is doing you no good.
> 
> But totally random is unusual in practise, no?
> 
> St.Ack

Uhm... not exactly.

One of the benefits of HBase is that it should scale in a *near* linear fashion.

So if we don't know how the data is to be accessed, or we know that there are a couple of access patterns that are orthogonal to each other, putting the data in to the cloud in a 'random' fashion should provide consistent read access times.

So the design of 'random' stored data shouldn't be that unusual. It just means you're going to have a couple of different indexes. ;-)

Re: HBase random access in HDFS and block indices

Posted by Stack <st...@duboce.net>.

On Fri, Oct 29, 2010 at 6:41 AM, Sean Bigdatafun
<se...@gmail.com> wrote:
> I have the same doubt here. Let's say I have a totally random read pattern
> (uniformly distributed).
>
> Now let's assume my total data size stored in HBase is 100TB on 10
> machines(not a big deal considering nowaday's disks), and the total size of
> my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> cache hit probablity. Under random read pattern, each read is bound to
> experience the "open-> read index -> .... -> read datablock" sequence, which
> would be expensive.
>
> Any comment?
>

If totally random, as per Alvin's suggestion, yes, just turn off block
caching since it is doing you no good.

But totally random is unusual in practise, no?

St.Ack

RE: HBase random access in HDFS and block indices

Posted by Michael Segel <mi...@hotmail.com>.

Once you instantiate the HFile object, you should be able to as many random get() against the table until you close the reference.


> Date: Tue, 2 Nov 2010 16:07:29 +0800
> Subject: Re: HBase random access in HDFS and block indices
> From: xietao.mailbox@gmail.com
> To: user@hbase.apache.org
> 
> I read the code and my understanding is when a RS starts StoreFiles of each
> Region
> will be instantiated. Then HFile.reader.loadFileInfo() will read the the
> index and file info.
> So each StoreFile is opened only once and block index are cached. The cache
> miss are
> for blocks. I mean for random Get each read does not need open HFile ->read
> index again.
> Is that right?
> 
> 2010/10/29 Sean Bigdatafun <se...@gmail.com>
> 
> > I have the same doubt here. Let's say I have a totally random read pattern
> > (uniformly distributed).
> >
> > Now let's assume my total data size stored in HBase is 100TB on 10
> > machines(not a big deal considering nowaday's disks), and the total size of
> > my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> > cache hit probablity. Under random read pattern, each read is bound to
> > experience the "open-> read index -> .... -> read datablock" sequence,
> > which
> > would be expensive.
> >
> > Any comment?
> >
> >
> >
> > On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> > > I was envisioning the HFiles being opened and closed more often, but it
> > > sounds like they're held open for long periods and that the indexes are
> > > permanently cached.  Is it roughly correct to say that after opening an
> > > HFile and loading its checksum/metadata/index/etc then each random data
> > > block access only requires a single pread, where the pread has some
> > > threading and connection overhead, but theoretically only requires one
> > disk
> > > seek.  I'm curious because I'm trying to do a lot of random reads, and
> > > given
> > > enough application parallelism, the disk seeks should become the
> > bottleneck
> > > much sooner than the network and threading overhead.
> > >
> > > Thanks again,
> > > Matt
> > >
> > > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ry...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Since the file is write-once, no random writes, putting the index at
> > > > the end is the only choice.  The loading goes like this:
> > > > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > > > - read location of additional variable length sections, eg: block index
> > > > - read those indexes, including the variable length 'file-info' section
> > > >
> > > >
> > > > So to give some background, by default an InputStream from DFSClient
> > > > has a single socket and positioned reads are fairly fast.  The DFS
> > > > just sees 'read bytes from pos X length Y' commands on an open socket.
> > > >  That is fast.  But only 1 thread at a time can use this interface.
> > > > So for 'get' requests we use another interface called pread() which
> > > > takes a position+length, and returns data.  This involves setting up a
> > > > 1-use socket and tearing it down when we are done.  So it is slower by
> > > > 2-3 tcp RTT, thread instantiation and other misc overhead.
> > > >
> > > >
> > > > Back to the HFile index, it is indeed stored in ram, not block cache.
> > > > Size is generally not an issue, hasn't been yet.  We ship with a
> > > > default block size of 64k, and I'd recommend sticking with that unless
> > > > you have evidential proof your data set performance is better under a
> > > > different size.  Since the index grows linearly by a factor of 1/64k
> > > > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > > > are spread around the cluster, so you are pushing load to all
> > > > machines.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > > > Do you guys ever worry about how big an HFile's index will be?  For
> > > > example,
> > > > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > > > blocks.
> > > > >  If each index entry is 50b, then you have a 3.2mb index which is way
> > > out
> > > > of
> > > > > line with your intention of having a small block size.  I believe
> > > that's
> > > > > read all at once so will be slow the first time... is the index
> > cached
> > > > > somewhere (block cache?) so that index accesses are from memory?
> > > > >
> > > > > And somewhat related - since the index is stored at the end of the
> > > HFile,
> > > > is
> > > > > an additional random access required to find its offset?  If it was
> > > > > considered, why was that chosen over putting it in it's own file that
> > > > could
> > > > > be accessed directly?
> > > > >
> > > > > Thanks for all these explanations,
> > > > > Matt
> > > > >
> > > > >
> > > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> The primary problem is the namenode memory. It contains entries for
> > > > every
> > > > >> file and block, so setting hdfs block size small limits your
> > > > scaleability.
> > > > >>
> > > > >> There is nothing inherently wrong with in file random read, Its just
> > > > That
> > > > >> the hdfs client was written for a single reader to read most of a
> > > file.
> > > > >> This to achieve high performance you'd need to do tricks, such as
> > > > >> pipelining
> > > > >> sockets and socket pool reuse. Right now for random reads We open a
> > > new
> > > > >> socket, read data then close it.
> > > > >> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com>
> > > wrote:
> > > > >> > Hi JG and Ryan,
> > > > >> > Thanks for the excellent answers.
> > > > >> >
> > > > >> > So, I am going to push everything to the extremes without
> > > considering
> > > > >> > the memory first. In theory, if in HBase, every cell size equals
> > to
> > > > >> > HBase block size, then there would not be any in block traverse.
> > In
> > > > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > > > >> > would not be any in-file random access necessary. This would
> > provide
> > > > >> > the best performance?
> > > > >> >
> > > > >> > But, the problem is that if the block in HBase is too large, the
> > > > >> > memory will run out since HBase load block into memory; if the
> > block
> > > > >> > in HDFS is too small, the DN will run out of memory since each
> > HDFS
> > > > >> > file takes some memory. So, it is a trade-off problem between
> > memory
> > > > >> > and performance. Is it right?
> > > > >> >
> > > > >> > And would it make any difference between random reading the same
> > > size
> > > > >> > file portion from of a small HDFS block and from a large HDFS
> > block?
> > > > >> >
> > > > >> > Thanks.
> > > > >> >
> > > > >> >
> > > > >> > William
> > > > >> >
> > > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com
> > >
> > > > >> wrote:
> > > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > > > weliam.cloud@gmail.com>
> > > > >> wrote:
> > > > >> >>> Hi,
> > > > >> >>> Recently I have spent some efforts to try to understand the
> > > > mechanisms
> > > > >> >>> of HBase to exploit possible performance tunning options. And
> > many
> > > > >> >>> thanks to the folks who helped with my questions in this
> > > community,
> > > > I
> > > > >> >>> have sent a report. But, there are still few questions left.
> > > > >> >>>
> > > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will
> > the
> > > > >> >>> block index in HFile point out the offset for every keyvalue
> > pair
> > > in
> > > > >> >>> that block? Or, the block index will just point out the key
> > ranges
> > > > >> >>> inside that block, so you have to traverse inside the block
> > until
> > > > you
> > > > >> >>> meet the key you are looking for?
> > > > >> >>
> > > > >> >> The block index contains the first key for every block.  It
> > > therefore
> > > > >> >> defines in an [a,b) manner the range of each block. Once a block
> > > has
> > > > >> >> been selected to read from, it is read into memory then iterated
> > > over
> > > > >> >> until the key in question has been found (or the closest match
> > has
> > > > >> >> been found).
> > > > >> >>
> > > > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> > > is
> > > > >> >>> this block read into memory?
> > > > >> >>
> > > > >> >> yes, the entire block at a time is read in a single read
> > operation.
> > > > >> >>
> > > > >> >>>
> > > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > > > >> >>> configurable), to read the HBase blocks, we have to random
> > access
> > > > the
> > > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > > > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > > > Would
> > > > >> >>> this be slow?
> > > > >> >>
> > > > >> >> Random access reads are not necessarily slow, they require
> > several
> > > > >> things:
> > > > >> >> - disk seeks to the data in question
> > > > >> >> - disk seeks to the checksum files in question
> > > > >> >> - checksum computation and verification
> > > > >> >>
> > > > >> >> While not particularly slow, this could probably be optimized a
> > > bit.
> > > > >> >>
> > > > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > > > >> >> reads and doing as much io-pushdown/scheduling as possible
> > without
> > > > >> >> consuming an excess of sockets and threads.  The actual speed can
> > > be
> > > > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > > > >> >>
> > > > >> >>
> > > > >> >>>
> > > > >> >>> Many thanks. I would be grateful for your answers.
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> William
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >
> > > >
> > >
> >

Re: HBase random access in HDFS and block indices

Posted by Tao Xie <xi...@gmail.com>.

I read the code and my understanding is when a RS starts StoreFiles of each
Region
will be instantiated. Then HFile.reader.loadFileInfo() will read the the
index and file info.
So each StoreFile is opened only once and block index are cached. The cache
miss are
for blocks. I mean for random Get each read does not need open HFile ->read
index again.
Is that right?

2010/10/29 Sean Bigdatafun <se...@gmail.com>

> I have the same doubt here. Let's say I have a totally random read pattern
> (uniformly distributed).
>
> Now let's assume my total data size stored in HBase is 100TB on 10
> machines(not a big deal considering nowaday's disks), and the total size of
> my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> cache hit probablity. Under random read pattern, each read is bound to
> experience the "open-> read index -> .... -> read datablock" sequence,
> which
> would be expensive.
>
> Any comment?
>
>
>
> On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > I was envisioning the HFiles being opened and closed more often, but it
> > sounds like they're held open for long periods and that the indexes are
> > permanently cached.  Is it roughly correct to say that after opening an
> > HFile and loading its checksum/metadata/index/etc then each random data
> > block access only requires a single pread, where the pread has some
> > threading and connection overhead, but theoretically only requires one
> disk
> > seek.  I'm curious because I'm trying to do a lot of random reads, and
> > given
> > enough application parallelism, the disk seeks should become the
> bottleneck
> > much sooner than the network and threading overhead.
> >
> > Thanks again,
> > Matt
> >
> > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > Since the file is write-once, no random writes, putting the index at
> > > the end is the only choice.  The loading goes like this:
> > > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > > - read location of additional variable length sections, eg: block index
> > > - read those indexes, including the variable length 'file-info' section
> > >
> > >
> > > So to give some background, by default an InputStream from DFSClient
> > > has a single socket and positioned reads are fairly fast.  The DFS
> > > just sees 'read bytes from pos X length Y' commands on an open socket.
> > >  That is fast.  But only 1 thread at a time can use this interface.
> > > So for 'get' requests we use another interface called pread() which
> > > takes a position+length, and returns data.  This involves setting up a
> > > 1-use socket and tearing it down when we are done.  So it is slower by
> > > 2-3 tcp RTT, thread instantiation and other misc overhead.
> > >
> > >
> > > Back to the HFile index, it is indeed stored in ram, not block cache.
> > > Size is generally not an issue, hasn't been yet.  We ship with a
> > > default block size of 64k, and I'd recommend sticking with that unless
> > > you have evidential proof your data set performance is better under a
> > > different size.  Since the index grows linearly by a factor of 1/64k
> > > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > > are spread around the cluster, so you are pushing load to all
> > > machines.
> > >
> > >
> > >
> > >
> > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > > Do you guys ever worry about how big an HFile's index will be?  For
> > > example,
> > > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > > blocks.
> > > >  If each index entry is 50b, then you have a 3.2mb index which is way
> > out
> > > of
> > > > line with your intention of having a small block size.  I believe
> > that's
> > > > read all at once so will be slow the first time... is the index
> cached
> > > > somewhere (block cache?) so that index accesses are from memory?
> > > >
> > > > And somewhat related - since the index is stored at the end of the
> > HFile,
> > > is
> > > > an additional random access required to find its offset?  If it was
> > > > considered, why was that chosen over putting it in it's own file that
> > > could
> > > > be accessed directly?
> > > >
> > > > Thanks for all these explanations,
> > > > Matt
> > > >
> > > >
> > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> > > wrote:
> > > >
> > > >> The primary problem is the namenode memory. It contains entries for
> > > every
> > > >> file and block, so setting hdfs block size small limits your
> > > scaleability.
> > > >>
> > > >> There is nothing inherently wrong with in file random read, Its just
> > > That
> > > >> the hdfs client was written for a single reader to read most of a
> > file.
> > > >> This to achieve high performance you'd need to do tricks, such as
> > > >> pipelining
> > > >> sockets and socket pool reuse. Right now for random reads We open a
> > new
> > > >> socket, read data then close it.
> > > >> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com>
> > wrote:
> > > >> > Hi JG and Ryan,
> > > >> > Thanks for the excellent answers.
> > > >> >
> > > >> > So, I am going to push everything to the extremes without
> > considering
> > > >> > the memory first. In theory, if in HBase, every cell size equals
> to
> > > >> > HBase block size, then there would not be any in block traverse.
> In
> > > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > > >> > would not be any in-file random access necessary. This would
> provide
> > > >> > the best performance?
> > > >> >
> > > >> > But, the problem is that if the block in HBase is too large, the
> > > >> > memory will run out since HBase load block into memory; if the
> block
> > > >> > in HDFS is too small, the DN will run out of memory since each
> HDFS
> > > >> > file takes some memory. So, it is a trade-off problem between
> memory
> > > >> > and performance. Is it right?
> > > >> >
> > > >> > And would it make any difference between random reading the same
> > size
> > > >> > file portion from of a small HDFS block and from a large HDFS
> block?
> > > >> >
> > > >> > Thanks.
> > > >> >
> > > >> >
> > > >> > William
> > > >> >
> > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com
> >
> > > >> wrote:
> > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > > weliam.cloud@gmail.com>
> > > >> wrote:
> > > >> >>> Hi,
> > > >> >>> Recently I have spent some efforts to try to understand the
> > > mechanisms
> > > >> >>> of HBase to exploit possible performance tunning options. And
> many
> > > >> >>> thanks to the folks who helped with my questions in this
> > community,
> > > I
> > > >> >>> have sent a report. But, there are still few questions left.
> > > >> >>>
> > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will
> the
> > > >> >>> block index in HFile point out the offset for every keyvalue
> pair
> > in
> > > >> >>> that block? Or, the block index will just point out the key
> ranges
> > > >> >>> inside that block, so you have to traverse inside the block
> until
> > > you
> > > >> >>> meet the key you are looking for?
> > > >> >>
> > > >> >> The block index contains the first key for every block.  It
> > therefore
> > > >> >> defines in an [a,b) manner the range of each block. Once a block
> > has
> > > >> >> been selected to read from, it is read into memory then iterated
> > over
> > > >> >> until the key in question has been found (or the closest match
> has
> > > >> >> been found).
> > > >> >>
> > > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> > is
> > > >> >>> this block read into memory?
> > > >> >>
> > > >> >> yes, the entire block at a time is read in a single read
> operation.
> > > >> >>
> > > >> >>>
> > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > > >> >>> configurable), to read the HBase blocks, we have to random
> access
> > > the
> > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > > Would
> > > >> >>> this be slow?
> > > >> >>
> > > >> >> Random access reads are not necessarily slow, they require
> several
> > > >> things:
> > > >> >> - disk seeks to the data in question
> > > >> >> - disk seeks to the checksum files in question
> > > >> >> - checksum computation and verification
> > > >> >>
> > > >> >> While not particularly slow, this could probably be optimized a
> > bit.
> > > >> >>
> > > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > > >> >> reads and doing as much io-pushdown/scheduling as possible
> without
> > > >> >> consuming an excess of sockets and threads.  The actual speed can
> > be
> > > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > > >> >>
> > > >> >>
> > > >> >>>
> > > >> >>> Many thanks. I would be grateful for your answers.
> > > >> >>>
> > > >> >>>
> > > >> >>> William
> > > >> >>>
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Re: HBase random access in HDFS and block indices

Posted by "Alvin C.L Huang" <CL...@itri.org.tw>.

@Sean

Consider an expected low hit ratio, cache will not benefit your random get.
This would cause too many java major GCs that pause all thread and thus bad
performance.

Try no cache at all.

--
Alvin C.-L., Huang
ATC, ICL, ITRI, Taiwan


On 29 October 2010 21:41, Sean Bigdatafun <se...@gmail.com> wrote:

> I have the same doubt here. Let's say I have a totally random read pattern
> (uniformly distributed).
>
> Now let's assume my total data size stored in HBase is 100TB on 10
> machines(not a big deal considering nowaday's disks), and the total size of
> my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> cache hit probablity. Under random read pattern, each read is bound to
> experience the "open-> read index -> .... -> read datablock" sequence,
> which
> would be expensive.
>
> Any comment?
>
>
>
> On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > I was envisioning the HFiles being opened and closed more often, but it
> > sounds like they're held open for long periods and that the indexes are
> > permanently cached.  Is it roughly correct to say that after opening an
> > HFile and loading its checksum/metadata/index/etc then each random data
> > block access only requires a single pread, where the pread has some
> > threading and connection overhead, but theoretically only requires one
> disk
> > seek.  I'm curious because I'm trying to do a lot of random reads, and
> > given
> > enough application parallelism, the disk seeks should become the
> bottleneck
> > much sooner than the network and threading overhead.
> >
> > Thanks again,
> > Matt
> >
> > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > Since the file is write-once, no random writes, putting the index at
> > > the end is the only choice.  The loading goes like this:
> > > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > > - read location of additional variable length sections, eg: block index
> > > - read those indexes, including the variable length 'file-info' section
> > >
> > >
> > > So to give some background, by default an InputStream from DFSClient
> > > has a single socket and positioned reads are fairly fast.  The DFS
> > > just sees 'read bytes from pos X length Y' commands on an open socket.
> > >  That is fast.  But only 1 thread at a time can use this interface.
> > > So for 'get' requests we use another interface called pread() which
> > > takes a position+length, and returns data.  This involves setting up a
> > > 1-use socket and tearing it down when we are done.  So it is slower by
> > > 2-3 tcp RTT, thread instantiation and other misc overhead.
> > >
> > >
> > > Back to the HFile index, it is indeed stored in ram, not block cache.
> > > Size is generally not an issue, hasn't been yet.  We ship with a
> > > default block size of 64k, and I'd recommend sticking with that unless
> > > you have evidential proof your data set performance is better under a
> > > different size.  Since the index grows linearly by a factor of 1/64k
> > > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > > are spread around the cluster, so you are pushing load to all
> > > machines.
> > >
> > >
> > >
> > >
> > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > > Do you guys ever worry about how big an HFile's index will be?  For
> > > example,
> > > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > > blocks.
> > > >  If each index entry is 50b, then you have a 3.2mb index which is way
> > out
> > > of
> > > > line with your intention of having a small block size.  I believe
> > that's
> > > > read all at once so will be slow the first time... is the index
> cached
> > > > somewhere (block cache?) so that index accesses are from memory?
> > > >
> > > > And somewhat related - since the index is stored at the end of the
> > HFile,
> > > is
> > > > an additional random access required to find its offset?  If it was
> > > > considered, why was that chosen over putting it in it's own file that
> > > could
> > > > be accessed directly?
> > > >
> > > > Thanks for all these explanations,
> > > > Matt
> > > >
> > > >
> > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> > > wrote:
> > > >
> > > >> The primary problem is the namenode memory. It contains entries for
> > > every
> > > >> file and block, so setting hdfs block size small limits your
> > > scaleability.
> > > >>
> > > >> There is nothing inherently wrong with in file random read, Its just
> > > That
> > > >> the hdfs client was written for a single reader to read most of a
> > file.
> > > >> This to achieve high performance you'd need to do tricks, such as
> > > >> pipelining
> > > >> sockets and socket pool reuse. Right now for random reads We open a
> > new
> > > >> socket, read data then close it.
> > > >> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com>
> > wrote:
> > > >> > Hi JG and Ryan,
> > > >> > Thanks for the excellent answers.
> > > >> >
> > > >> > So, I am going to push everything to the extremes without
> > considering
> > > >> > the memory first. In theory, if in HBase, every cell size equals
> to
> > > >> > HBase block size, then there would not be any in block traverse.
> In
> > > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > > >> > would not be any in-file random access necessary. This would
> provide
> > > >> > the best performance?
> > > >> >
> > > >> > But, the problem is that if the block in HBase is too large, the
> > > >> > memory will run out since HBase load block into memory; if the
> block
> > > >> > in HDFS is too small, the DN will run out of memory since each
> HDFS
> > > >> > file takes some memory. So, it is a trade-off problem between
> memory
> > > >> > and performance. Is it right?
> > > >> >
> > > >> > And would it make any difference between random reading the same
> > size
> > > >> > file portion from of a small HDFS block and from a large HDFS
> block?
> > > >> >
> > > >> > Thanks.
> > > >> >
> > > >> >
> > > >> > William
> > > >> >
> > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ryanobjc@gmail.com
> >
> > > >> wrote:
> > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > > weliam.cloud@gmail.com>
> > > >> wrote:
> > > >> >>> Hi,
> > > >> >>> Recently I have spent some efforts to try to understand the
> > > mechanisms
> > > >> >>> of HBase to exploit possible performance tunning options. And
> many
> > > >> >>> thanks to the folks who helped with my questions in this
> > community,
> > > I
> > > >> >>> have sent a report. But, there are still few questions left.
> > > >> >>>
> > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will
> the
> > > >> >>> block index in HFile point out the offset for every keyvalue
> pair
> > in
> > > >> >>> that block? Or, the block index will just point out the key
> ranges
> > > >> >>> inside that block, so you have to traverse inside the block
> until
> > > you
> > > >> >>> meet the key you are looking for?
> > > >> >>
> > > >> >> The block index contains the first key for every block.  It
> > therefore
> > > >> >> defines in an [a,b) manner the range of each block. Once a block
> > has
> > > >> >> been selected to read from, it is read into memory then iterated
> > over
> > > >> >> until the key in question has been found (or the closest match
> has
> > > >> >> been found).
> > > >> >>
> > > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> > is
> > > >> >>> this block read into memory?
> > > >> >>
> > > >> >> yes, the entire block at a time is read in a single read
> operation.
> > > >> >>
> > > >> >>>
> > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > > >> >>> configurable), to read the HBase blocks, we have to random
> access
> > > the
> > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > > Would
> > > >> >>> this be slow?
> > > >> >>
> > > >> >> Random access reads are not necessarily slow, they require
> several
> > > >> things:
> > > >> >> - disk seeks to the data in question
> > > >> >> - disk seeks to the checksum files in question
> > > >> >> - checksum computation and verification
> > > >> >>
> > > >> >> While not particularly slow, this could probably be optimized a
> > bit.
> > > >> >>
> > > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > > >> >> reads and doing as much io-pushdown/scheduling as possible
> without
> > > >> >> consuming an excess of sockets and threads.  The actual speed can
> > be
> > > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > > >> >>
> > > >> >>
> > > >> >>>
> > > >> >>> Many thanks. I would be grateful for your answers.
> > > >> >>>
> > > >> >>>
> > > >> >>> William
> > > >> >>>
> > > >> >>
> > > >>
> > > >
> > >
> >
>

Re: HBase random access in HDFS and block indices

Posted by Sean Bigdatafun <se...@gmail.com>.

I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).

Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.

Any comment?



On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <mc...@hotpads.com> wrote:

> I was envisioning the HFiles being opened and closed more often, but it
> sounds like they're held open for long periods and that the indexes are
> permanently cached.  Is it roughly correct to say that after opening an
> HFile and loading its checksum/metadata/index/etc then each random data
> block access only requires a single pread, where the pread has some
> threading and connection overhead, but theoretically only requires one disk
> seek.  I'm curious because I'm trying to do a lot of random reads, and
> given
> enough application parallelism, the disk seeks should become the bottleneck
> much sooner than the network and threading overhead.
>
> Thanks again,
> Matt
>
> On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
> > Hi,
> >
> > Since the file is write-once, no random writes, putting the index at
> > the end is the only choice.  The loading goes like this:
> > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > - read location of additional variable length sections, eg: block index
> > - read those indexes, including the variable length 'file-info' section
> >
> >
> > So to give some background, by default an InputStream from DFSClient
> > has a single socket and positioned reads are fairly fast.  The DFS
> > just sees 'read bytes from pos X length Y' commands on an open socket.
> >  That is fast.  But only 1 thread at a time can use this interface.
> > So for 'get' requests we use another interface called pread() which
> > takes a position+length, and returns data.  This involves setting up a
> > 1-use socket and tearing it down when we are done.  So it is slower by
> > 2-3 tcp RTT, thread instantiation and other misc overhead.
> >
> >
> > Back to the HFile index, it is indeed stored in ram, not block cache.
> > Size is generally not an issue, hasn't been yet.  We ship with a
> > default block size of 64k, and I'd recommend sticking with that unless
> > you have evidential proof your data set performance is better under a
> > different size.  Since the index grows linearly by a factor of 1/64k
> > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > are spread around the cluster, so you are pushing load to all
> > machines.
> >
> >
> >
> >
> > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > > Do you guys ever worry about how big an HFile's index will be?  For
> > example,
> > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > blocks.
> > >  If each index entry is 50b, then you have a 3.2mb index which is way
> out
> > of
> > > line with your intention of having a small block size.  I believe
> that's
> > > read all at once so will be slow the first time... is the index cached
> > > somewhere (block cache?) so that index accesses are from memory?
> > >
> > > And somewhat related - since the index is stored at the end of the
> HFile,
> > is
> > > an additional random access required to find its offset?  If it was
> > > considered, why was that chosen over putting it in it's own file that
> > could
> > > be accessed directly?
> > >
> > > Thanks for all these explanations,
> > > Matt
> > >
> > >
> > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> > wrote:
> > >
> > >> The primary problem is the namenode memory. It contains entries for
> > every
> > >> file and block, so setting hdfs block size small limits your
> > scaleability.
> > >>
> > >> There is nothing inherently wrong with in file random read, Its just
> > That
> > >> the hdfs client was written for a single reader to read most of a
> file.
> > >> This to achieve high performance you'd need to do tricks, such as
> > >> pipelining
> > >> sockets and socket pool reuse. Right now for random reads We open a
> new
> > >> socket, read data then close it.
> > >> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com>
> wrote:
> > >> > Hi JG and Ryan,
> > >> > Thanks for the excellent answers.
> > >> >
> > >> > So, I am going to push everything to the extremes without
> considering
> > >> > the memory first. In theory, if in HBase, every cell size equals to
> > >> > HBase block size, then there would not be any in block traverse. In
> > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > >> > would not be any in-file random access necessary. This would provide
> > >> > the best performance?
> > >> >
> > >> > But, the problem is that if the block in HBase is too large, the
> > >> > memory will run out since HBase load block into memory; if the block
> > >> > in HDFS is too small, the DN will run out of memory since each HDFS
> > >> > file takes some memory. So, it is a trade-off problem between memory
> > >> > and performance. Is it right?
> > >> >
> > >> > And would it make any difference between random reading the same
> size
> > >> > file portion from of a small HDFS block and from a large HDFS block?
> > >> >
> > >> > Thanks.
> > >> >
> > >> >
> > >> > William
> > >> >
> > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com>
> > >> wrote:
> > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > weliam.cloud@gmail.com>
> > >> wrote:
> > >> >>> Hi,
> > >> >>> Recently I have spent some efforts to try to understand the
> > mechanisms
> > >> >>> of HBase to exploit possible performance tunning options. And many
> > >> >>> thanks to the folks who helped with my questions in this
> community,
> > I
> > >> >>> have sent a report. But, there are still few questions left.
> > >> >>>
> > >> >>> 1. If a HFile block contains more than one keyvalue pair, will the
> > >> >>> block index in HFile point out the offset for every keyvalue pair
> in
> > >> >>> that block? Or, the block index will just point out the key ranges
> > >> >>> inside that block, so you have to traverse inside the block until
> > you
> > >> >>> meet the key you are looking for?
> > >> >>
> > >> >> The block index contains the first key for every block.  It
> therefore
> > >> >> defines in an [a,b) manner the range of each block. Once a block
> has
> > >> >> been selected to read from, it is read into memory then iterated
> over
> > >> >> until the key in question has been found (or the closest match has
> > >> >> been found).
> > >> >>
> > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> is
> > >> >>> this block read into memory?
> > >> >>
> > >> >> yes, the entire block at a time is read in a single read operation.
> > >> >>
> > >> >>>
> > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > >> >>> configurable), to read the HBase blocks, we have to random access
> > the
> > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > Would
> > >> >>> this be slow?
> > >> >>
> > >> >> Random access reads are not necessarily slow, they require several
> > >> things:
> > >> >> - disk seeks to the data in question
> > >> >> - disk seeks to the checksum files in question
> > >> >> - checksum computation and verification
> > >> >>
> > >> >> While not particularly slow, this could probably be optimized a
> bit.
> > >> >>
> > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > >> >> reads and doing as much io-pushdown/scheduling as possible without
> > >> >> consuming an excess of sockets and threads.  The actual speed can
> be
> > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > >> >>
> > >> >>
> > >> >>>
> > >> >>> Many thanks. I would be grateful for your answers.
> > >> >>>
> > >> >>>
> > >> >>> William
> > >> >>>
> > >> >>
> > >>
> > >
> >
>

Re: HBase random access in HDFS and block indices

Posted by Matt Corgan <mc...@hotpads.com>.

I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached.  Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one disk
seek.  I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the bottleneck
much sooner than the network and threading overhead.

Thanks again,
Matt

On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Hi,
>
> Since the file is write-once, no random writes, putting the index at
> the end is the only choice.  The loading goes like this:
> - read fixed file trailer, ie: filelen.offset - <fixed size>
> - read location of additional variable length sections, eg: block index
> - read those indexes, including the variable length 'file-info' section
>
>
> So to give some background, by default an InputStream from DFSClient
> has a single socket and positioned reads are fairly fast.  The DFS
> just sees 'read bytes from pos X length Y' commands on an open socket.
>  That is fast.  But only 1 thread at a time can use this interface.
> So for 'get' requests we use another interface called pread() which
> takes a position+length, and returns data.  This involves setting up a
> 1-use socket and tearing it down when we are done.  So it is slower by
> 2-3 tcp RTT, thread instantiation and other misc overhead.
>
>
> Back to the HFile index, it is indeed stored in ram, not block cache.
> Size is generally not an issue, hasn't been yet.  We ship with a
> default block size of 64k, and I'd recommend sticking with that unless
> you have evidential proof your data set performance is better under a
> different size.  Since the index grows linearly by a factor of 1/64k
> with the bytes of the data, it isn't a huge deal.  Also the indexes
> are spread around the cluster, so you are pushing load to all
> machines.
>
>
>
>
> On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > Do you guys ever worry about how big an HFile's index will be?  For
> example,
> > if you have a 512mb HFile with 8k block size, you will have 64,000
> blocks.
> >  If each index entry is 50b, then you have a 3.2mb index which is way out
> of
> > line with your intention of having a small block size.  I believe that's
> > read all at once so will be slow the first time... is the index cached
> > somewhere (block cache?) so that index accesses are from memory?
> >
> > And somewhat related - since the index is stored at the end of the HFile,
> is
> > an additional random access required to find its offset?  If it was
> > considered, why was that chosen over putting it in it's own file that
> could
> > be accessed directly?
> >
> > Thanks for all these explanations,
> > Matt
> >
> >
> > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >
> >> The primary problem is the namenode memory. It contains entries for
> every
> >> file and block, so setting hdfs block size small limits your
> scaleability.
> >>
> >> There is nothing inherently wrong with in file random read, Its just
> That
> >> the hdfs client was written for a single reader to read most of a file.
> >> This to achieve high performance you'd need to do tricks, such as
> >> pipelining
> >> sockets and socket pool reuse. Right now for random reads We open a new
> >> socket, read data then close it.
> >> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com> wrote:
> >> > Hi JG and Ryan,
> >> > Thanks for the excellent answers.
> >> >
> >> > So, I am going to push everything to the extremes without considering
> >> > the memory first. In theory, if in HBase, every cell size equals to
> >> > HBase block size, then there would not be any in block traverse. In
> >> > HDFS, very HBase block size equals to each HDFS block size, there
> >> > would not be any in-file random access necessary. This would provide
> >> > the best performance?
> >> >
> >> > But, the problem is that if the block in HBase is too large, the
> >> > memory will run out since HBase load block into memory; if the block
> >> > in HDFS is too small, the DN will run out of memory since each HDFS
> >> > file takes some memory. So, it is a trade-off problem between memory
> >> > and performance. Is it right?
> >> >
> >> > And would it make any difference between random reading the same size
> >> > file portion from of a small HDFS block and from a large HDFS block?
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > William
> >> >
> >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com>
> >> wrote:
> >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> weliam.cloud@gmail.com>
> >> wrote:
> >> >>> Hi,
> >> >>> Recently I have spent some efforts to try to understand the
> mechanisms
> >> >>> of HBase to exploit possible performance tunning options. And many
> >> >>> thanks to the folks who helped with my questions in this community,
> I
> >> >>> have sent a report. But, there are still few questions left.
> >> >>>
> >> >>> 1. If a HFile block contains more than one keyvalue pair, will the
> >> >>> block index in HFile point out the offset for every keyvalue pair in
> >> >>> that block? Or, the block index will just point out the key ranges
> >> >>> inside that block, so you have to traverse inside the block until
> you
> >> >>> meet the key you are looking for?
> >> >>
> >> >> The block index contains the first key for every block.  It therefore
> >> >> defines in an [a,b) manner the range of each block. Once a block has
> >> >> been selected to read from, it is read into memory then iterated over
> >> >> until the key in question has been found (or the closest match has
> >> >> been found).
> >> >>
> >> >>> 2. When HBase read block to fetching the data or traverse in it, is
> >> >>> this block read into memory?
> >> >>
> >> >> yes, the entire block at a time is read in a single read operation.
> >> >>
> >> >>>
> >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> >> >>> configurable), to read the HBase blocks, we have to random access
> the
> >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> >> >>> portion of the larger HDFS blocks, it is still a random access.
> Would
> >> >>> this be slow?
> >> >>
> >> >> Random access reads are not necessarily slow, they require several
> >> things:
> >> >> - disk seeks to the data in question
> >> >> - disk seeks to the checksum files in question
> >> >> - checksum computation and verification
> >> >>
> >> >> While not particularly slow, this could probably be optimized a bit.
> >> >>
> >> >> Most of the issues with random reads in HDFS is parallelizing the
> >> >> reads and doing as much io-pushdown/scheduling as possible without
> >> >> consuming an excess of sockets and threads.  The actual speed can be
> >> >> excellent, or not, depending on how busy the IO subsystem is.
> >> >>
> >> >>
> >> >>>
> >> >>> Many thanks. I would be grateful for your answers.
> >> >>>
> >> >>>
> >> >>> William
> >> >>>
> >> >>
> >>
> >
>

Re: HBase random access in HDFS and block indices

Posted by Ryan Rawson <ry...@gmail.com>.

Hi,

Since the file is write-once, no random writes, putting the index at
the end is the only choice.  The loading goes like this:
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section


So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast.  The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
 That is fast.  But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data.  This involves setting up a
1-use socket and tearing it down when we are done.  So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.


Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet.  We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size.  Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal.  Also the indexes
are spread around the cluster, so you are pushing load to all
machines.




On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <mc...@hotpads.com> wrote:
> Do you guys ever worry about how big an HFile's index will be?  For example,
> if you have a 512mb HFile with 8k block size, you will have 64,000 blocks.
>  If each index entry is 50b, then you have a 3.2mb index which is way out of
> line with your intention of having a small block size.  I believe that's
> read all at once so will be slow the first time... is the index cached
> somewhere (block cache?) so that index accesses are from memory?
>
> And somewhat related - since the index is stored at the end of the HFile, is
> an additional random access required to find its offset?  If it was
> considered, why was that chosen over putting it in it's own file that could
> be accessed directly?
>
> Thanks for all these explanations,
> Matt
>
>
> On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> The primary problem is the namenode memory. It contains entries for every
>> file and block, so setting hdfs block size small limits your scaleability.
>>
>> There is nothing inherently wrong with in file random read, Its just That
>> the hdfs client was written for a single reader to read most of a file.
>> This to achieve high performance you'd need to do tricks, such as
>> pipelining
>> sockets and socket pool reuse. Right now for random reads We open a new
>> socket, read data then close it.
>> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com> wrote:
>> > Hi JG and Ryan,
>> > Thanks for the excellent answers.
>> >
>> > So, I am going to push everything to the extremes without considering
>> > the memory first. In theory, if in HBase, every cell size equals to
>> > HBase block size, then there would not be any in block traverse. In
>> > HDFS, very HBase block size equals to each HDFS block size, there
>> > would not be any in-file random access necessary. This would provide
>> > the best performance?
>> >
>> > But, the problem is that if the block in HBase is too large, the
>> > memory will run out since HBase load block into memory; if the block
>> > in HDFS is too small, the DN will run out of memory since each HDFS
>> > file takes some memory. So, it is a trade-off problem between memory
>> > and performance. Is it right?
>> >
>> > And would it make any difference between random reading the same size
>> > file portion from of a small HDFS block and from a large HDFS block?
>> >
>> > Thanks.
>> >
>> >
>> > William
>> >
>> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <we...@gmail.com>
>> wrote:
>> >>> Hi,
>> >>> Recently I have spent some efforts to try to understand the mechanisms
>> >>> of HBase to exploit possible performance tunning options. And many
>> >>> thanks to the folks who helped with my questions in this community, I
>> >>> have sent a report. But, there are still few questions left.
>> >>>
>> >>> 1. If a HFile block contains more than one keyvalue pair, will the
>> >>> block index in HFile point out the offset for every keyvalue pair in
>> >>> that block? Or, the block index will just point out the key ranges
>> >>> inside that block, so you have to traverse inside the block until you
>> >>> meet the key you are looking for?
>> >>
>> >> The block index contains the first key for every block.  It therefore
>> >> defines in an [a,b) manner the range of each block. Once a block has
>> >> been selected to read from, it is read into memory then iterated over
>> >> until the key in question has been found (or the closest match has
>> >> been found).
>> >>
>> >>> 2. When HBase read block to fetching the data or traverse in it, is
>> >>> this block read into memory?
>> >>
>> >> yes, the entire block at a time is read in a single read operation.
>> >>
>> >>>
>> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
>> >>> configurable), to read the HBase blocks, we have to random access the
>> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
>> >>> portion of the larger HDFS blocks, it is still a random access. Would
>> >>> this be slow?
>> >>
>> >> Random access reads are not necessarily slow, they require several
>> things:
>> >> - disk seeks to the data in question
>> >> - disk seeks to the checksum files in question
>> >> - checksum computation and verification
>> >>
>> >> While not particularly slow, this could probably be optimized a bit.
>> >>
>> >> Most of the issues with random reads in HDFS is parallelizing the
>> >> reads and doing as much io-pushdown/scheduling as possible without
>> >> consuming an excess of sockets and threads.  The actual speed can be
>> >> excellent, or not, depending on how busy the IO subsystem is.
>> >>
>> >>
>> >>>
>> >>> Many thanks. I would be grateful for your answers.
>> >>>
>> >>>
>> >>> William
>> >>>
>> >>
>>
>

Re: HBase random access in HDFS and block indices

Posted by Matt Corgan <mc...@hotpads.com>.

Do you guys ever worry about how big an HFile's index will be?  For example,
if you have a 512mb HFile with 8k block size, you will have 64,000 blocks.
 If each index entry is 50b, then you have a 3.2mb index which is way out of
line with your intention of having a small block size.  I believe that's
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?

And somewhat related - since the index is stored at the end of the HFile, is
an additional random access required to find its offset?  If it was
considered, why was that chosen over putting it in it's own file that could
be accessed directly?

Thanks for all these explanations,
Matt


On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <ry...@gmail.com> wrote:

> The primary problem is the namenode memory. It contains entries for every
> file and block, so setting hdfs block size small limits your scaleability.
>
> There is nothing inherently wrong with in file random read, Its just That
> the hdfs client was written for a single reader to read most of a file.
> This to achieve high performance you'd need to do tricks, such as
> pipelining
> sockets and socket pool reuse. Right now for random reads We open a new
> socket, read data then close it.
> On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com> wrote:
> > Hi JG and Ryan,
> > Thanks for the excellent answers.
> >
> > So, I am going to push everything to the extremes without considering
> > the memory first. In theory, if in HBase, every cell size equals to
> > HBase block size, then there would not be any in block traverse. In
> > HDFS, very HBase block size equals to each HDFS block size, there
> > would not be any in-file random access necessary. This would provide
> > the best performance?
> >
> > But, the problem is that if the block in HBase is too large, the
> > memory will run out since HBase load block into memory; if the block
> > in HDFS is too small, the DN will run out of memory since each HDFS
> > file takes some memory. So, it is a trade-off problem between memory
> > and performance. Is it right?
> >
> > And would it make any difference between random reading the same size
> > file portion from of a small HDFS block and from a large HDFS block?
> >
> > Thanks.
> >
> >
> > William
> >
> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <we...@gmail.com>
> wrote:
> >>> Hi,
> >>> Recently I have spent some efforts to try to understand the mechanisms
> >>> of HBase to exploit possible performance tunning options. And many
> >>> thanks to the folks who helped with my questions in this community, I
> >>> have sent a report. But, there are still few questions left.
> >>>
> >>> 1. If a HFile block contains more than one keyvalue pair, will the
> >>> block index in HFile point out the offset for every keyvalue pair in
> >>> that block? Or, the block index will just point out the key ranges
> >>> inside that block, so you have to traverse inside the block until you
> >>> meet the key you are looking for?
> >>
> >> The block index contains the first key for every block.  It therefore
> >> defines in an [a,b) manner the range of each block. Once a block has
> >> been selected to read from, it is read into memory then iterated over
> >> until the key in question has been found (or the closest match has
> >> been found).
> >>
> >>> 2. When HBase read block to fetching the data or traverse in it, is
> >>> this block read into memory?
> >>
> >> yes, the entire block at a time is read in a single read operation.
> >>
> >>>
> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> >>> configurable), to read the HBase blocks, we have to random access the
> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> >>> portion of the larger HDFS blocks, it is still a random access. Would
> >>> this be slow?
> >>
> >> Random access reads are not necessarily slow, they require several
> things:
> >> - disk seeks to the data in question
> >> - disk seeks to the checksum files in question
> >> - checksum computation and verification
> >>
> >> While not particularly slow, this could probably be optimized a bit.
> >>
> >> Most of the issues with random reads in HDFS is parallelizing the
> >> reads and doing as much io-pushdown/scheduling as possible without
> >> consuming an excess of sockets and threads.  The actual speed can be
> >> excellent, or not, depending on how busy the IO subsystem is.
> >>
> >>
> >>>
> >>> Many thanks. I would be grateful for your answers.
> >>>
> >>>
> >>> William
> >>>
> >>
>

Re: HBase random access in HDFS and block indices

Posted by Ryan Rawson <ry...@gmail.com>.

The primary problem is the namenode memory. It contains entries for every
file and block, so setting hdfs block size small limits your scaleability.

There is nothing inherently wrong with in file random read, Its just That
the hdfs client was written for a single reader to read most of a file.
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a new
socket, read data then close it.
On Oct 18, 2010 8:22 PM, "William Kang" <we...@gmail.com> wrote:
> Hi JG and Ryan,
> Thanks for the excellent answers.
>
> So, I am going to push everything to the extremes without considering
> the memory first. In theory, if in HBase, every cell size equals to
> HBase block size, then there would not be any in block traverse. In
> HDFS, very HBase block size equals to each HDFS block size, there
> would not be any in-file random access necessary. This would provide
> the best performance?
>
> But, the problem is that if the block in HBase is too large, the
> memory will run out since HBase load block into memory; if the block
> in HDFS is too small, the DN will run out of memory since each HDFS
> file takes some memory. So, it is a trade-off problem between memory
> and performance. Is it right?
>
> And would it make any difference between random reading the same size
> file portion from of a small HDFS block and from a large HDFS block?
>
> Thanks.
>
>
> William
>
> On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <we...@gmail.com>
wrote:
>>> Hi,
>>> Recently I have spent some efforts to try to understand the mechanisms
>>> of HBase to exploit possible performance tunning options. And many
>>> thanks to the folks who helped with my questions in this community, I
>>> have sent a report. But, there are still few questions left.
>>>
>>> 1. If a HFile block contains more than one keyvalue pair, will the
>>> block index in HFile point out the offset for every keyvalue pair in
>>> that block? Or, the block index will just point out the key ranges
>>> inside that block, so you have to traverse inside the block until you
>>> meet the key you are looking for?
>>
>> The block index contains the first key for every block.  It therefore
>> defines in an [a,b) manner the range of each block. Once a block has
>> been selected to read from, it is read into memory then iterated over
>> until the key in question has been found (or the closest match has
>> been found).
>>
>>> 2. When HBase read block to fetching the data or traverse in it, is
>>> this block read into memory?
>>
>> yes, the entire block at a time is read in a single read operation.
>>
>>>
>>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
>>> configurable), to read the HBase blocks, we have to random access the
>>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
>>> portion of the larger HDFS blocks, it is still a random access. Would
>>> this be slow?
>>
>> Random access reads are not necessarily slow, they require several
things:
>> - disk seeks to the data in question
>> - disk seeks to the checksum files in question
>> - checksum computation and verification
>>
>> While not particularly slow, this could probably be optimized a bit.
>>
>> Most of the issues with random reads in HDFS is parallelizing the
>> reads and doing as much io-pushdown/scheduling as possible without
>> consuming an excess of sockets and threads.  The actual speed can be
>> excellent, or not, depending on how busy the IO subsystem is.
>>
>>
>>>
>>> Many thanks. I would be grateful for your answers.
>>>
>>>
>>> William
>>>
>>

Re: HBase random access in HDFS and block indices

Posted by William Kang <we...@gmail.com>.

Hi JG and Ryan,
Thanks for the excellent answers.

So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?

But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?

And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?

Thanks.


William

On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <ry...@gmail.com> wrote:
> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <we...@gmail.com> wrote:
>> Hi,
>> Recently I have spent some efforts to try to understand the mechanisms
>> of HBase to exploit possible performance tunning options. And many
>> thanks to the folks who helped with my questions in this community, I
>> have sent a report. But, there are still few questions left.
>>
>> 1. If a HFile block contains more than one keyvalue pair, will the
>> block index in HFile point out the offset for every keyvalue pair in
>> that block? Or, the block index will just point out the key ranges
>> inside that block, so you have to traverse inside the block until you
>> meet the key you are looking for?
>
> The block index contains the first key for every block.  It therefore
> defines in an [a,b) manner the range of each block. Once a block has
> been selected to read from, it is read into memory then iterated over
> until the key in question has been found (or the closest match has
> been found).
>
>> 2. When HBase read block to fetching the data or traverse in it, is
>> this block read into memory?
>
> yes, the entire block at a time is read in a single read operation.
>
>>
>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
>> configurable), to read the HBase blocks, we have to random access the
>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
>> portion of the larger HDFS blocks, it is still a random access. Would
>> this be slow?
>
> Random access reads are not necessarily slow, they require several things:
> - disk seeks to the data in question
> - disk seeks to the checksum files in question
> - checksum computation and verification
>
> While not particularly slow, this could probably be optimized a bit.
>
> Most of the issues with random reads in HDFS is parallelizing the
> reads and doing as much io-pushdown/scheduling as possible without
> consuming an excess of sockets and threads.  The actual speed can be
> excellent, or not, depending on how busy the IO subsystem is.
>
>
>>
>> Many thanks. I would be grateful for your answers.
>>
>>
>> William
>>
>

Re: HBase random access in HDFS and block indices

Posted by Ryan Rawson <ry...@gmail.com>.

On Mon, Oct 18, 2010 at 7:49 PM, William Kang <we...@gmail.com> wrote:
> Hi,
> Recently I have spent some efforts to try to understand the mechanisms
> of HBase to exploit possible performance tunning options. And many
> thanks to the folks who helped with my questions in this community, I
> have sent a report. But, there are still few questions left.
>
> 1. If a HFile block contains more than one keyvalue pair, will the
> block index in HFile point out the offset for every keyvalue pair in
> that block? Or, the block index will just point out the key ranges
> inside that block, so you have to traverse inside the block until you
> meet the key you are looking for?

The block index contains the first key for every block.  It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).

> 2. When HBase read block to fetching the data or traverse in it, is
> this block read into memory?

yes, the entire block at a time is read in a single read operation.

>
> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> configurable), to read the HBase blocks, we have to random access the
> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> portion of the larger HDFS blocks, it is still a random access. Would
> this be slow?

Random access reads are not necessarily slow, they require several things:
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification

While not particularly slow, this could probably be optimized a bit.

Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads.  The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.

>
> Many thanks. I would be grateful for your answers.
>
>
> William
>

Fwd: HBase random access in HDFS and block indices

Posted by William Kang <we...@gmail.com>.

Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.

1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?

2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?

3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?

Many thanks. I would be grateful for your answers.


William

RE: HBase random access in HDFS and block indices

Posted by Jonathan Gray <jg...@facebook.com>.

Hi William.  Answers inline.

> -----Original Message-----
> From: William Kang [mailto:weliam.cloud@gmail.com]
> Sent: Monday, October 18, 2010 7:48 PM
> To: hbase-user
> Subject: HBase random access in HDFS and block indices
> 
> Hi,
> Recently I have spent some efforts to try to understand the mechanisms
> of HBase to exploit possible performance tunning options. And many
> thanks to the folks who helped with my questions in this community, I
> have sent a report. But, there are still few questions left.
> 
> 1. If a HFile block contains more than one keyvalue pair, will the
> block index in HFile point out the offset for every keyvalue pair in
> that block? Or, the block index will just point out the key ranges
> inside that block, so you have to traverse inside the block until you
> meet the key you are looking for?

It is the latter.  Block index points to the start keys of each block, so you effectively have a range for each block.

Lots of work has gone in recently to seek/reseek/early-out when possible and skip unnecessary blocks.

> 2. When HBase read block to fetching the data or traverse in it, is
> this block read into memory?

Yes.  And if the block cache is turned on, it will be put into an LRU cache.

> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> configurable), to read the HBase blocks, we have to random access the
> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> portion of the larger HDFS blocks, it is still a random access. Would
> this be slow?

Yes, this is still random access.  HBase provides the indexing/retrieval/etc on top of HDFS to make the random read access as efficient as possible (and with caching) and makes random writes possible.

JG

> 
> Many thanks. I would be grateful for your answers.
> 
> 
> William