You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stuti Awasthi <st...@hcl.com> on 2011/11/16 05:39:17 UTC

Query on analyze big data with Hbase

Hi all,

I have a scenario in which my Hbase tables will be fed with data size more than 250GB every day. I have to do analysis on that data using MR jobs and save the output in Hbase table itself.

1. My concern is will Hbase be able to handle such data as it is built to handle big data?

2. What hardware /hbase configuration points I must keep in mind to create a cluster for such requirement.?

a. How many region server?

b. How many regions per region server ?

3. My schema is such that in one table with one cf , there will be millions of column qualifier. What can be the consequences of such design.

Please suggest.

Regards,
Stuti Awasthi
HCL Comnet Systems and Services Ltd
F-8/9 Basement, Sec-3,Noida.

________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

RE: Query on analyze big data with Hbase

Posted by Stuti Awasthi <st...@hcl.com>.

Hey Cosmin,

Thanks for info. I will do more testing to know the factors which can cause issue. :)

-----Original Message-----
From: Cosmin Lehene [mailto:clehene@adobe.com] 
Sent: Wednesday, November 16, 2011 3:52 PM
To: user@hbase.apache.org; hbase-user@hadoop.apache.org
Subject: Re: Query on analyze big data with Hbase

You should consider looking over the available HBase resources There's an online book http://hbase.apache.org/book.html

And there's Lars George's book from O'Reilly
(http://shop.oreilly.com/product/0636920014348.do)

On 11/16/11 6:39 AM, "Stuti Awasthi" <st...@hcl.com> wrote:

>Hi all,
>
>I have a scenario in which my Hbase tables will be fed with data size 
>more than 250GB every day. I have to do analysis on that data using MR 
>jobs and save the output in Hbase table itself.
>
>1.      My concern is will Hbase be able to handle such data as it is
>built to handle big data?
Yes, it should be able to handle this amount of data. However you need to determine the number of simultaneous requests and the size of each request so you could determine the minimum number of region servers and their configuration.

You could do some testing on a small cluster once you decide the hardware you're going to use.

>
>2.      What hardware /hbase configuration points I must keep in mind to
>create a cluster for such requirement.?

It depends on the data access patterns: e.g. run a map-reduce job incrementally on the new data or have the data available for lots of random reads.
It also depends on the desired duration of the map-reduce job or the average latency you want for the random reads.

Generally depending on what you need you'll have to tune a core x spindle x RAM formula.
If you have to few disks then you'll end up with a IO bottleneck, if you add too many you'll either saturate the CPU or the NIC and have some disks idle.
I'm not sure if a golden rule is what you should be relying on, but 1 core x 1 spindle x 4 GB RAM is common so you can use this as a baseline and adapt. 
Optimizing the map-reduce code will generally change things dramatically :).

You need to take bandwidth utilization into account as well: considering that all data written through the HBase API will (optionally) initially go to 
	a Write-Ahead Log (WAL) in HDFS that is replicated on 3 machines
	in HRegionServer cache (RAM) - these are flushed to HDFS as well (3
replicas)
One of the replicas will always be on the local machine (given that you run DataNode and HRegionServer on same machines), but the two others will go out on different machines.

>
>a.      How many region server?

This depends a lot on the data access pattern and on the hardware that the
HRegionServer runs on (how much RAM, how many cores, how many spindles).
Normally if you don't access much the old data, than it's ok to keep it on
less region servers with more space as it won't take up resources.

>
>b.      How many regions per region server ?

There are some points on this in the books. By default the region size is
250MB and it's configurable to larger sizes. Facebook has some interesting
points on this as well.
There's a balance between avoiding region splits (if a region grows larger
than the defined size it will be split in two) and having a good data
distribution on the cluster (e.g. if you have huge region and all the
writes go to that one you'll end up using a single region server for all
writes) - so you need to decide a good key distribution.

>
>3.      My schema is such that in one table with one cf , there will be
>millions of column qualifier. What can be the consequences of such design.

It means that you need to make sure you're not exceeding the region size
with a single row.
You also have to consider that getting an entire row will be an expensive
operation. You should look at the batching options for Scans
(incrementally retrieve batches of columns from a row).

Again, testing is key :)

You could consider outputting the MR job output to a HFile that you can
load into HBase after, instead of going through the HBase API - especially
if the resulted data is large.

Cosmin

>
>Please suggest.
>
>Regards,
>Stuti Awasthi
>HCL Comnet Systems and Services Ltd
>F-8/9 Basement, Sec-3,Noida.
>
>
>________________________________
>::DISCLAIMER::
>--------------------------------------------------------------------------
>---------------------------------------------
>
>The contents of this e-mail and any attachment(s) are confidential and
>intended for the named recipient(s) only.
>It shall not attach any liability on the originator or HCL or its
>affiliates. Any views or opinions presented in
>this email are solely those of the author and may not necessarily reflect
>the opinions of HCL or its affiliates.
>Any form of reproduction, dissemination, copying, disclosure,
>modification, distribution and / or publication of
>this message without the prior written consent of the author of this
>e-mail is strictly prohibited. If you have
>received this email in error please delete it and notify the sender
>immediately. Before opening any mail and
>attachments please check them for viruses and defect.
>
>--------------------------------------------------------------------------
>---------------------------------------------

Re: Query on analyze big data with Hbase

Posted by Cosmin Lehene <cl...@adobe.com>.

You should consider looking over the available HBase resources
There's an online book http://hbase.apache.org/book.html

And there's Lars George's book from O'Reilly
(http://shop.oreilly.com/product/0636920014348.do)

On 11/16/11 6:39 AM, "Stuti Awasthi" <st...@hcl.com> wrote:

>Hi all,
>
>I have a scenario in which my Hbase tables will be fed with data size
>more than 250GB every day. I have to do analysis on that data using MR
>jobs and save the output in Hbase table itself.
>
>1.      My concern is will Hbase be able to handle such data as it is
>built to handle big data?
Yes, it should be able to handle this amount of data. However you need to
determine the number of simultaneous requests and the size of each request
so you could determine the minimum number of region servers and their
configuration.

You could do some testing on a small cluster once you decide the hardware
you're going to use.

>
>2.      What hardware /hbase configuration points I must keep in mind to
>create a cluster for such requirement.?

It depends on the data access patterns: e.g. run a map-reduce job
incrementally on the new data or have the data available for lots of
random reads.
It also depends on the desired duration of the map-reduce job or the
average latency you want for the random reads.

Generally depending on what you need you'll have to tune a core x spindle
x RAM formula.
If you have to few disks then you'll end up with a IO bottleneck, if you
add too many you'll either saturate the CPU or the NIC and have some disks
idle.
I'm not sure if a golden rule is what you should be relying on, but 1 core
x 1 spindle x 4 GB RAM is common so you can use this as a baseline and
adapt. 
Optimizing the map-reduce code will generally change things dramatically
:).

You need to take bandwidth utilization into account as well: considering
that all data written through the HBase API will (optionally) initially go
to 
	a Write-Ahead Log (WAL) in HDFS that is replicated on 3 machines
	in HRegionServer cache (RAM) - these are flushed to HDFS as well (3
replicas) 
One of the replicas will always be on the local machine (given that you
run DataNode and HRegionServer on same machines), but the two others will
go out on different machines.

>
>a.      How many region server?

This depends a lot on the data access pattern and on the hardware that the
HRegionServer runs on (how much RAM, how many cores, how many spindles).
Normally if you don't access much the old data, than it's ok to keep it on
less region servers with more space as it won't take up resources.

>
>b.      How many regions per region server ?

There are some points on this in the books. By default the region size is
250MB and it's configurable to larger sizes. Facebook has some interesting
points on this as well.
There's a balance between avoiding region splits (if a region grows larger
than the defined size it will be split in two) and having a good data
distribution on the cluster (e.g. if you have huge region and all the
writes go to that one you'll end up using a single region server for all
writes) - so you need to decide a good key distribution.

>
>3.      My schema is such that in one table with one cf , there will be
>millions of column qualifier. What can be the consequences of such design.

It means that you need to make sure you're not exceeding the region size
with a single row.
You also have to consider that getting an entire row will be an expensive
operation. You should look at the batching options for Scans
(incrementally retrieve batches of columns from a row).

Again, testing is key :)

You could consider outputting the MR job output to a HFile that you can
load into HBase after, instead of going through the HBase API - especially
if the resulted data is large.

Cosmin

>
>Please suggest.
>
>Regards,
>Stuti Awasthi
>HCL Comnet Systems and Services Ltd
>F-8/9 Basement, Sec-3,Noida.
>
>
>________________________________
>::DISCLAIMER::
>--------------------------------------------------------------------------
>---------------------------------------------
>
>The contents of this e-mail and any attachment(s) are confidential and
>intended for the named recipient(s) only.
>It shall not attach any liability on the originator or HCL or its
>affiliates. Any views or opinions presented in
>this email are solely those of the author and may not necessarily reflect
>the opinions of HCL or its affiliates.
>Any form of reproduction, dissemination, copying, disclosure,
>modification, distribution and / or publication of
>this message without the prior written consent of the author of this
>e-mail is strictly prohibited. If you have
>received this email in error please delete it and notify the sender
>immediately. Before opening any mail and
>attachments please check them for viruses and defect.
>
>--------------------------------------------------------------------------
>---------------------------------------------