You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Tatsuya Kawano <ta...@gmail.com> on 2010/06/04 02:06:32 UTC

Adding a tiny HBase cluster to existing Hadoop environment

Hello,

I remember Jon was talking other day that he was trying a single HBase  
server with existing HDFS cluster to serve map reduce (MR) results. I  
wonder if this went well or not.

A couple of friends in Tokyo are considering HBase to do a similar  
thing. They want to serve MR results inside the clients' companies via  
HBase. They both have existing MR/HDFS emvironment; one has a small (<  
10) and another has a large (> 50) clusters.

They'll use the incremental loading to existing table (HBASE-1923) to  
add the MR results to the HBase table, and only few users will read  
and export (web CSV download) the results via HBase. So HBase will be  
lightly loaded. They probably won't even need high availability (HA)  
option on HBase.

So I'm thinking to recommend them to add just one server (non-HA) or  
two servers (HA) to their Hadoop cluster, and run only HMaster and  
Region Server processes on the server(s). The HBase cluster will  
utilize the existing (small or large) HDFS cluster and ZooKeeper  
ensemble.

The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. The  
RAM size will be change depending on the data volume and access pattern.

Has anybody tried a similar configuration? and how it goes?


Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said  
that I'd better to have at least 5 Region Servers / Data Nodes in my  
cluster to get the typical performance. If I deploy RS and DN on  
separate servers, which one should be >= 5 nodes? DN? RS? or both?


Thanks,
Tatsuya Kawano
Tokyo, Japan

Re: Adding a tiny HBase cluster to existing Hadoop environment

Posted by Tatsuya Kawano <ta...@gmail.com>.

Hi Todd, 

Thanks for answering my question. 

> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano wrote:
>> I remember Jon was talking other day that he was trying a single HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I wonder
>> if this went well or not.

>> So I'm thinking to recommend them to add just one server (non-HA) or two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region Server
>> processes on the server(s). The HBase cluster will utilize the existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.

I went back to the mailing list archive and found that the information I needed was already there; Jon wrote down pros and cons in a similar configuration. 

RE: HBase on 1 box? how big?
http://markmail.org/thread/3yfoou4gna2fex5f#query:+page:1+mid:4m27ay3mwuh2a5vu+state:results

On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:
> If your "exported dataset" from the MR job is small enough to fit on one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make more
> sense to simply export TSV/CSV and then use a tool like Sqoop to export to a
> relational database. That way you'd have better off the shelf integration
> with various other tools or access methods.

Thanks for the suggestion. In this particular configuration, I'm expecting one RS can handle far larger dataset than typical HBase configuration. The dataset is read-only, so all memstores will be empty. This leaves more room on the RAM, and the RS could take more regions than usual. Also, the RS is backed by the current HDFS installation. The larger cluster has more than 50 Data Nodes, and this could give the RS better concurrent random read capacity than a single node RDB with local hard drives.  

I talked to the guys last night, and one of the guys is also evaluating RDBs (Sybase, Oracle and MySQL).  His current concern is loading the large dataset to RDB is time consuming. He's going to try the native import utilities for the RDBs, and Sqoop is on his list too. (He attended Cloudera Hadoop training in Tokyo.)  But he also wants to try HBase as another option because it has better MR integration. 

>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my cluster to
>> get the typical performance. If I deploy RS and DN on separate servers,
>> which one should be >= 5 nodes? DN? RS? or both?
>> 
>> 
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.

If I could build the cluster from a scratch, I would suggest so. The difficult part of my case is the current installations (50+ servers) are not intended to deploy RSs. I need to add more processor cores and RAM to the current servers to make reliable Task Tracker + DN + RS nodes. Also, it's obvious I don't need all 50+ servers to have RS, so maybe five of them? But having only five region servers on 50+ data nodes results the HDFS data blocks unevenly distributed across the cluster. This won't be an optimal solution. 

So, in this particular case, I'd rather separate RSs from the DNs to make the data blocks evenly distributed. I'm not sure if this causes bad performance on random read, because the network latency in today's hardware is good enough (average 0.1 ms) compared to the server-class 15,000 RPM hard drives (5 ms). The only drawback I can think of is network congestion when doing massive writes and scans, but my case doesn't do such operations. 

It was good to know that having less than five region servers is not a bad idea (as long as you have enough number of HDFS data nodes). You and Jon's email gave me some information about things to avoid, and one of my friends is evaluating RDBs as well. 

Thanks, 
Tatsuya

On 06/04/2010, at 9:37 AM, Todd Lipcon wrote:

> Hi Tatsuya,
> 
> On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano <ta...@gmail.com>wrote:
> 
>> Hello,
>> 
>> I remember Jon was talking other day that he was trying a single HBase
>> server with existing HDFS cluster to serve map reduce (MR) results. I wonder
>> if this went well or not.
>> 
>> A couple of friends in Tokyo are considering HBase to do a similar thing.
>> They want to serve MR results inside the clients' companies via HBase. They
>> both have existing MR/HDFS emvironment; one has a small (< 10) and another
>> has a large (> 50) clusters.
>> 
>> They'll use the incremental loading to existing table (HBASE-1923) to add
>> the MR results to the HBase table, and only few users will read and export
>> (web CSV download) the results via HBase. So HBase will be lightly loaded.
>> They probably won't even need high availability (HA) option on HBase.
>> 
>> So I'm thinking to recommend them to add just one server (non-HA) or two
>> servers (HA) to their Hadoop cluster, and run only HMaster and Region Server
>> processes on the server(s). The HBase cluster will utilize the existing
>> (small or large) HDFS cluster and ZooKeeper ensemble.
>> 
>> 
> If your "exported dataset" from the MR job is small enough to fit on one
> server, you can certainly use a single HBase RS plus the bulk load
> functionality. However, with a small dataset like that it might make more
> sense to simply export TSV/CSV and then use a tool like Sqoop to export to a
> relational database. That way you'd have better off the shelf integration
> with various other tools or access methods.
> 
> 
>> The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. The RAM
>> size will be change depending on the data volume and access pattern.
>> 
>> Has anybody tried a similar configuration? and how it goes?
>> 
>> 
>> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that
>> I'd better to have at least 5 Region Servers / Data Nodes in my cluster to
>> get the typical performance. If I deploy RS and DN on separate servers,
>> which one should be >= 5 nodes? DN? RS? or both?
>> 
>> 
> Better to colocate the DNs and RSs for most deployments. You get
> significantly better random read performance for uncached data.
> 
> -Todd
> 
> 
>> 
>> Thanks,
>> Tatsuya Kawano
>> Tokyo, Japan
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Adding a tiny HBase cluster to existing Hadoop environment

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Tatsuya,

On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano <ta...@gmail.com>wrote:

> Hello,
>
> I remember Jon was talking other day that he was trying a single HBase
> server with existing HDFS cluster to serve map reduce (MR) results. I wonder
> if this went well or not.
>
> A couple of friends in Tokyo are considering HBase to do a similar thing.
> They want to serve MR results inside the clients' companies via HBase. They
> both have existing MR/HDFS emvironment; one has a small (< 10) and another
> has a large (> 50) clusters.
>
> They'll use the incremental loading to existing table (HBASE-1923) to add
> the MR results to the HBase table, and only few users will read and export
> (web CSV download) the results via HBase. So HBase will be lightly loaded.
> They probably won't even need high availability (HA) option on HBase.
>
> So I'm thinking to recommend them to add just one server (non-HA) or two
> servers (HA) to their Hadoop cluster, and run only HMaster and Region Server
> processes on the server(s). The HBase cluster will utilize the existing
> (small or large) HDFS cluster and ZooKeeper ensemble.
>
>
If your "exported dataset" from the MR job is small enough to fit on one
server, you can certainly use a single HBase RS plus the bulk load
functionality. However, with a small dataset like that it might make more
sense to simply export TSV/CSV and then use a tool like Sqoop to export to a
relational database. That way you'd have better off the shelf integration
with various other tools or access methods.


> The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. The RAM
> size will be change depending on the data volume and access pattern.
>
> Has anybody tried a similar configuration? and how it goes?
>
>
> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that
> I'd better to have at least 5 Region Servers / Data Nodes in my cluster to
> get the typical performance. If I deploy RS and DN on separate servers,
> which one should be >= 5 nodes? DN? RS? or both?
>
>
Better to colocate the DNs and RSs for most deployments. You get
significantly better random read performance for uncached data.

-Todd


>
> Thanks,
> Tatsuya Kawano
> Tokyo, Japan
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera