You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/01/19 12:34:27 UTC

Regarding data storage in HBase

Hi,

According to the `Hadoop - The Definitive Guide`

Writes arriving at a regionserver are first appended to a commit log and
then are added to an in-memory memstore. When a memstore fills, its content
is flushed to the filesystem.
The commit log is hosted on HDFS, so it remains available through a
regionserver crash.

Couple of questions

1. When the memstore fills, is it flushed to HDFS or local file system?

2. If the region size (hbase.hregion.max.filesize) is set to 200MB and the
HDFS Block Size is set to 64MB, will the region be split across 4 data
nodes? I know that this doesn't make sense to split a single regions data
across nodes in HDFS, but how is it handled in HBase?

3. Is region size (hbase.hregion.max.filesize) the size of commit log or
the size of the file that has been flushed?

4. The commit log might become big over time, is there similar concept of
checkpoint in HBase for the commit logs?

I am familiar with HDFS and trying to map it to HBase.

Regards,
Praveen

Re: Regarding data storage in HBase

Posted by Stack <st...@duboce.net>.
On Thu, Jan 19, 2012 at 3:34 AM, Praveen Sripati
<pr...@gmail.com>wrote:

> 1. When the memstore fills, is it flushed to HDFS or local file system?
>
>
HDFS



> 2. If the region size (hbase.hregion.max.filesize) is set to 200MB and the
> HDFS Block Size is set to 64MB, will the region be split across 4 data
> nodes? I know that this doesn't make sense to split a single regions data
> across nodes in HDFS, but how is it handled in HBase?
>
>
You mean file in the above rather than region?

If so, yes, the file will be made of multiple HDFS blocks.  The blocks will
be replicated.  Usually one replica is on the datanode local to the
regionserver.  See the reference guide for more on hbase locality.




> 3. Is region size (hbase.hregion.max.filesize) the size of commit log or
> the size of the file that has been flushed?
>
>
Its about files under a region.  WALs/logs have their own configs.



> 4. The commit log might become big over time, is there similar concept of
> checkpoint in HBase for the commit logs?
>
>
WALs are rolled at configurable size -- usually 64MB.  WALs that have edits
that have been all flushed to hfiles are let go/deleted.

St.Ack

Re: Regarding data storage in HBase

Posted by Praveen Sripati <pr...@gmail.com>.
Thanks for the response.

> 4. The commit log might become big over time, is there similar concept of
> checkpoint in HBase for the commit logs?
>
>WALs are rolled at configurable size -- usually 64MB. WALs that have edits
that have been all flushed to hfiles are let go/deleted.

1) Are WAL's flushed to HFile periodically or just in the case of a
regionserver crash? The WALs may grow over time, that's the purpose of
asking this query? In HDFS the flush is done when the WAL size reaches
'dfs.namenode.checkpoint.size' or after every
'dfs.namenode.checkpoint.period' seconds.

2) I went through the 'HBase Architecture 101 - Storage' blog entry (1)
authored 2 years back which was very useful. Is it still relevant?

Praveen

(1) - http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

On Thu, Jan 19, 2012 at 11:42 PM, Doug Meil
<do...@explorysmedical.com>wrote:

>
> Hi there-
>
> re: #1
>
> HDFS
>
> See http://hbase.apache.org/book.html#regions.arch
>
> Also see http://hbase.apache.org/book.html#trouble.namenode.hbase.objects
> for what the directory structure looks like in HDFS.
>
>
> Re #2:
>
> Flushes are written as StoreFiles in HDFS.
>
> See http://hbase.apache.org/book.html#regions.arch
>
> Also see the section on "Region-RegionServer Locality"
>
> re: #3
>
> Flushed files, the total size of StoreFiles per region.
>
>
> See http://hbase.apache.org/book.html#regions.arch
>
> #4.  Not entirely sure about what you are asking, but see the WAL section
> in the Regions section.
>
>
>
>
> On 1/19/12 6:34 AM, "Praveen Sripati" <pr...@gmail.com> wrote:
>
> >Hi,
> >
> >According to the `Hadoop - The Definitive Guide`
> >
> >Writes arriving at a regionserver are first appended to a commit log and
> >then are added to an in-memory memstore. When a memstore fills, its
> >content
> >is flushed to the filesystem.
> >The commit log is hosted on HDFS, so it remains available through a
> >regionserver crash.
> >
> >Couple of questions
> >
> >1. When the memstore fills, is it flushed to HDFS or local file system?
> >
> >2. If the region size (hbase.hregion.max.filesize) is set to 200MB and the
> >HDFS Block Size is set to 64MB, will the region be split across 4 data
> >nodes? I know that this doesn't make sense to split a single regions data
> >across nodes in HDFS, but how is it handled in HBase?
> >
> >3. Is region size (hbase.hregion.max.filesize) the size of commit log or
> >the size of the file that has been flushed?
> >
> >4. The commit log might become big over time, is there similar concept of
> >checkpoint in HBase for the commit logs?
> >
> >I am familiar with HDFS and trying to map it to HBase.
> >
> >Regards,
> >Praveen
>
>
>

Re: Regarding data storage in HBase

Posted by Doug Meil <do...@explorysmedical.com>.
Hi there-

re: #1 

HDFS

See http://hbase.apache.org/book.html#regions.arch

Also see http://hbase.apache.org/book.html#trouble.namenode.hbase.objects
for what the directory structure looks like in HDFS.


Re #2:

Flushes are written as StoreFiles in HDFS.

See http://hbase.apache.org/book.html#regions.arch

Also see the section on "Region-RegionServer Locality"

re: #3

Flushed files, the total size of StoreFiles per region.


See http://hbase.apache.org/book.html#regions.arch

#4.  Not entirely sure about what you are asking, but see the WAL section
in the Regions section.




On 1/19/12 6:34 AM, "Praveen Sripati" <pr...@gmail.com> wrote:

>Hi,
>
>According to the `Hadoop - The Definitive Guide`
>
>Writes arriving at a regionserver are first appended to a commit log and
>then are added to an in-memory memstore. When a memstore fills, its
>content
>is flushed to the filesystem.
>The commit log is hosted on HDFS, so it remains available through a
>regionserver crash.
>
>Couple of questions
>
>1. When the memstore fills, is it flushed to HDFS or local file system?
>
>2. If the region size (hbase.hregion.max.filesize) is set to 200MB and the
>HDFS Block Size is set to 64MB, will the region be split across 4 data
>nodes? I know that this doesn't make sense to split a single regions data
>across nodes in HDFS, but how is it handled in HBase?
>
>3. Is region size (hbase.hregion.max.filesize) the size of commit log or
>the size of the file that has been flushed?
>
>4. The commit log might become big over time, is there similar concept of
>checkpoint in HBase for the commit logs?
>
>I am familiar with HDFS and trying to map it to HBase.
>
>Regards,
>Praveen