You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sean Laurent <or...@gmail.com> on 2009/02/03 23:13:00 UTC

HBase and Hadoop MapReduce - Common setups?

Howdy folks,
We're evaluating HBase and we're trying to get a good solid picture of how
everything fits together... specifically, we're wondering how people
commonly setup HBase. I'm imagining you typically run the region servers on
the same machines as the HDFS data nodes to gain data locality benefits. And
from what I've seen on the mailing list, it's typically recommended
(although it sounds like it's up for debate in terms of SPoF issues) to run
separate machines for the HBaseMaster and NameNode servers.

Is it something along the following lines?

1x HBaseMaster
1x HDFS NameNode
N machines with both HRegionServer and DataNode

Now what about Hadoop and task trackers? Do people typically run completely
separate clusters for their M/R tasks? Do they run task trackers along side
the region server and data nodes? Or add machines that run TaskTracker and
DataNode servers but ~not~ HRegionServer?

Any thoughts or opinions would be greatly appreciated!

-Sean

RE: HBase and Hadoop MapReduce - Common setups?

Posted by Jonathan Gray <jl...@streamy.com>.

Yes you can/will have contention when sharing the resources like that.

Most clusters are built on 4 core machines with 4GB of RAM (some slightly
worse, some slightly better) so there are sufficient resources to go around.

You'll need to limit the total number of maps/reduces allowed per node to
ensure that running tasks do not starve the Datanode or Regionserver.  The
limit would depend on the nature of your tasks.  If CPU-bound, you would
want to make sure no more than 2 (or 3 if you want to push it) were running
on any given node if you had four cores.

JG

> -----Original Message-----
> From: Sean Laurent [mailto:organicveggie@gmail.com]
> Sent: Tuesday, February 03, 2009 2:49 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase and Hadoop MapReduce - Common setups?
> 
> Okay, that sounds like what I expected. But isn't there a strong
> likelihood
> for competition for HDFS resources between a M/R task running on a
> TaskTracker and the RegionServer running on the same machine?
> 
> In other words, let's say a Hadoop M/R task is running on a given
> TaskTracker and it's actively reading data from HDFS via the DataNode
> (and
> both are on the same machine for locality reasons). At the same time,
> another client is running an HBase BatchUpdate that affects the data
> stored
> on that very same DataNode. Won't that create a bottleneck? Or do the
> HBase
> operations like BatchUpdate actually run as M/R tasks? Or am I over
> estimating the data-retrieval problem?
> 
> Thanks!
> 
> -Sean
> 
> On Tue, Feb 3, 2009 at 4:42 PM, Jonathan Gray <jl...@streamy.com>
> wrote:
> 
> > Sean,
> >
> > You're going to want to run your TaskTrackers local to your DataNodes
> and
> > RegionServers, again for locality reasons.  That's one of the primary
> > advantages of MapReduce, moving computation to data.
> >
> > Otherwise, you are on track.  Of course the setup depends on what
> you're
> > doing, but what you describe is on a majority of the HBase setups I'm
> aware
> > of.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Sean Laurent [mailto:organicveggie@gmail.com]
> > > Sent: Tuesday, February 03, 2009 2:13 PM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: HBase and Hadoop MapReduce - Common setups?
> > >
> > > Howdy folks,
> > > We're evaluating HBase and we're trying to get a good solid picture
> of
> > > how
> > > everything fits together... specifically, we're wondering how
> people
> > > commonly setup HBase. I'm imagining you typically run the region
> > > servers on
> > > the same machines as the HDFS data nodes to gain data locality
> > > benefits. And
> > > from what I've seen on the mailing list, it's typically recommended
> > > (although it sounds like it's up for debate in terms of SPoF
> issues) to
> > > run
> > > separate machines for the HBaseMaster and NameNode servers.
> > >
> > > Is it something along the following lines?
> > >
> > > 1x HBaseMaster
> > > 1x HDFS NameNode
> > > N machines with both HRegionServer and DataNode
> > >
> > > Now what about Hadoop and task trackers? Do people typically run
> > > completely
> > > separate clusters for their M/R tasks? Do they run task trackers
> along
> > > side
> > > the region server and data nodes? Or add machines that run
> TaskTracker
> > > and
> > > DataNode servers but ~not~ HRegionServer?
> > >
> > > Any thoughts or opinions would be greatly appreciated!
> >

Re: HBase and Hadoop MapReduce - Common setups?

Posted by Sean Laurent <or...@gmail.com>.

Okay, that sounds like what I expected. But isn't there a strong likelihood
for competition for HDFS resources between a M/R task running on a
TaskTracker and the RegionServer running on the same machine?

In other words, let's say a Hadoop M/R task is running on a given
TaskTracker and it's actively reading data from HDFS via the DataNode (and
both are on the same machine for locality reasons). At the same time,
another client is running an HBase BatchUpdate that affects the data stored
on that very same DataNode. Won't that create a bottleneck? Or do the HBase
operations like BatchUpdate actually run as M/R tasks? Or am I over
estimating the data-retrieval problem?

Thanks!

-Sean

On Tue, Feb 3, 2009 at 4:42 PM, Jonathan Gray <jl...@streamy.com> wrote:

> Sean,
>
> You're going to want to run your TaskTrackers local to your DataNodes and
> RegionServers, again for locality reasons.  That's one of the primary
> advantages of MapReduce, moving computation to data.
>
> Otherwise, you are on track.  Of course the setup depends on what you're
> doing, but what you describe is on a majority of the HBase setups I'm aware
> of.
>
> JG
>
> > -----Original Message-----
> > From: Sean Laurent [mailto:organicveggie@gmail.com]
> > Sent: Tuesday, February 03, 2009 2:13 PM
> > To: hbase-user@hadoop.apache.org
> > Subject: HBase and Hadoop MapReduce - Common setups?
> >
> > Howdy folks,
> > We're evaluating HBase and we're trying to get a good solid picture of
> > how
> > everything fits together... specifically, we're wondering how people
> > commonly setup HBase. I'm imagining you typically run the region
> > servers on
> > the same machines as the HDFS data nodes to gain data locality
> > benefits. And
> > from what I've seen on the mailing list, it's typically recommended
> > (although it sounds like it's up for debate in terms of SPoF issues) to
> > run
> > separate machines for the HBaseMaster and NameNode servers.
> >
> > Is it something along the following lines?
> >
> > 1x HBaseMaster
> > 1x HDFS NameNode
> > N machines with both HRegionServer and DataNode
> >
> > Now what about Hadoop and task trackers? Do people typically run
> > completely
> > separate clusters for their M/R tasks? Do they run task trackers along
> > side
> > the region server and data nodes? Or add machines that run TaskTracker
> > and
> > DataNode servers but ~not~ HRegionServer?
> >
> > Any thoughts or opinions would be greatly appreciated!
>

RE: HBase and Hadoop MapReduce - Common setups?

Posted by Jonathan Gray <jl...@streamy.com>.

Sean,

You're going to want to run your TaskTrackers local to your DataNodes and
RegionServers, again for locality reasons.  That's one of the primary
advantages of MapReduce, moving computation to data.

Otherwise, you are on track.  Of course the setup depends on what you're
doing, but what you describe is on a majority of the HBase setups I'm aware
of.

JG

> -----Original Message-----
> From: Sean Laurent [mailto:organicveggie@gmail.com]
> Sent: Tuesday, February 03, 2009 2:13 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase and Hadoop MapReduce - Common setups?
> 
> Howdy folks,
> We're evaluating HBase and we're trying to get a good solid picture of
> how
> everything fits together... specifically, we're wondering how people
> commonly setup HBase. I'm imagining you typically run the region
> servers on
> the same machines as the HDFS data nodes to gain data locality
> benefits. And
> from what I've seen on the mailing list, it's typically recommended
> (although it sounds like it's up for debate in terms of SPoF issues) to
> run
> separate machines for the HBaseMaster and NameNode servers.
> 
> Is it something along the following lines?
> 
> 1x HBaseMaster
> 1x HDFS NameNode
> N machines with both HRegionServer and DataNode
> 
> Now what about Hadoop and task trackers? Do people typically run
> completely
> separate clusters for their M/R tasks? Do they run task trackers along
> side
> the region server and data nodes? Or add machines that run TaskTracker
> and
> DataNode servers but ~not~ HRegionServer?
> 
> Any thoughts or opinions would be greatly appreciated!
> 
> -Sean