You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Clements, Michael" <Mi...@disney.com> on 2010/01/08 00:09:40 UTC

load balanced task distribution

We have built a Hadoop prototype that is using Map-Reduce on HDFS &
HBase to analyze web traffic data. It is deployed on a 4 node cluster of
commodity 4-core machines running Linux. It works, but we are getting
pitiful performance.

The main problem is the way the Hadoop task tracker allocates tasks to
machines. One simply configures a max # of tasks per machine, and Hadoop
blindly gives it tasks until reaching that limit. The problem is, we
have to configure these limits artificially low to prevent the machine
from exploding when it has to do other work.

For example, consider a Map-Red task that loads data into HBase. As the
HBase table grows, it splits new regions onto other servers. These are
the same servers that are running Map-Red tasks. So whatever server
happens to be hosting the region that all the Map-Red tasks are using,
gets overloaded, the HBase process gets slow to respond, and the MapRed
job fails.

Another example is that not all tasks are created equal. Sometimes 10
tasks per machine is ideal, sometimes only 2. It is tedious and clumsy
to configure each job individually for how many tasks it should have per
server.

The only way we've found around this is to configure the Map-Red tasks
limit so low, it can still handle any extra load that comes its way (for
example, becoming an HBase region server). But this means all but one of
the machines in the cluster are mostly idle (just in case they become
the HBase region server) which gives poor overall performance. I would
expect a simple load based task distribution to run 3-10 times faster,
since Hadoop is currently using only a small fraction of the available
processing power in each machine.

We could separate HBase into other servers instead of sharing the same
machines with Map-Red tasks. But this at best only reduces the magnitude
of the problem, as it is still a static allocation that wastes machines
with idle CPU.

Load based task distribution would automatically compensate for this and
dynamically adjust as the job runs. Imagine if the Hadoop task tracker
allocated tasks to machines based on each machine's load average.
Instead of configuring maximum task count in mapred-site.xml, we'd like
to configure a target load average (for example 2.0). This would give
dynamic task balancing that would automatically compensate for different
tasks, different hardware, etc.

As the HBase region server moved from machine A to B, B's load average
would increase and the Hadoop task tracker would stop giving it jobs.
Meanwhile, A's load average would drop and Hadoop would give it more
tasks. Each machine in the cluster would run steadily at a targeted load
average which gives maximum throughput.

I don't believe the "Fair Scheduler" would solve this problem, because
it shares jobs across the cluster, not the tasks of each job. My
understanding is that it still uses the same static approach to
allocating the tasks.

But within the FairScheduler source code I found a "LoadManager"
interface. It has one implementation, CapBasedLoadManager, which
allocates tasks based on max counts.

We could implement our own LoadManager which uses the system load
average [for example by checking
java.lang.management.OperatingSystemMXBean.getSystemLoadAverage()]. But
this is specific to the FairScheduler. Ideally, such a task balancer
could be used with ANY job scheduler, not just the fair scheduler.

That is, one shouldn't need to use the fair scheduler just to get load
average based task distribution for each job. This should be done
independently, regardless of how jobs are prioritized in the cluster.

Does the community have a good solution to this problem? Does the
LoadManager approach make sense? How could it be applied more generally
(not just within FairScheduler)?

Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile


RE: load balanced task distribution

Posted by "Clements, Michael" <Mi...@disney.com>.
Matei, thanks for the info and tips.

To cut to the chase:
Does the community have any simple & easy means to elevate the priority
of the HDFS & HBase processes higher than the Map-Reduce (JobTracker &
TaskTracker) processes when Hadoop is started? Some kind of config
setting for "nice" levels? A script we can edit? I looked through the
docs & didn't find anything, so I'm setting the CPU priorities manually,
on every cluster machine, after I start Hadoop. This is tedious but it
works for now.

The details: 
Today I ran through several different configurations - separating out
HBase region servers, adjusting task counts, and CPU priorities for
Hadoop processes ("nice" levels). I found best results as follows:

1. Configuring the cluster symmetrically: every node runs Map-Reduce
tasks and is an HBase region server (e.g. don't set aside machines
dedicated exclusively to HBase).

2. Set the # of tasks per machine to some reasonable number (for our
4-CPU machines, 4-8 tasks works well). The ideal will vary by job so
pick some middling value.

3. Elevate the CPU priorities (nice -10) of all the HDFS & HBase
processes. If you do a ps grep, processes running: HMaster, NameNode,
DataNode, HQuorumPeer, HRegionServer. This is absolutely essential -
otherwise, Map-Red tasks will starve the HBase region server and the
entire job will fail.

4. Leave the Map-Red processes at normal priority: JobTracker,
TaskTracker. This lets the HBase region server starve them, if
necessary.

5. Ensure the HBase table is big enough, and split up enough, that every
node will be serving regions. Even if all your tasks get assigned
different HBase regions, if they are all on the same region server, the
region server will throttle your tasks.
 
With this configuration I've about doubled the performance and the
overall CPU usage across the machines is more or less full and steady
with load average ranging from 2 to 6 (50% to 150% of the number of
CPUs). It's not perfect but it's a lot better. If the machine gets
overloaded, the Map-Red tasks don't starve the HBase region server so
everything keeps running.

I think a load average based distribution would be even better, but this
is good enough for now.

-----Original Message-----
From:
mapreduce-user-return-254-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-254-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Matei Zaharia
Sent: Thursday, January 07, 2010 4:42 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: load balanced task distribution

Hi Michael,

The Fair Scheduler's LoadManager was indeed put in place to allow for
resource-aware scheduling in the future. Actually, Scott Chen from
Facebook is currently working towards this feature. His latest patch
related to it is https://issues.apache.org/jira/browse/MAPREDUCE-1218,
which puts in place a framework to collect resource utilization
information on TaskTrackers that can later be used by a LoadManager.

Unfortunately, I'm not sure that something like LoadManager will be
standardized across schedulers soon, partly because people are still
experimenting with how to build schedulers and want to maintain
low-level control. However, standardizing may not be needed because the
feature differences between schedulers are decreasing. For example, in
Hadoop 0.21, the Fair Scheduler will support FIFO + priorities for jobs
in the same pool, allowing it to emulate the default FIFO scheduler if
desired (or allowing you to have 2 pools with different weights and FIFO
inside each one, etc). Hopefully this is enough flexibility to be
useful. Of course, if one interface proves successful, other schedulers
might also adopt it.

A load-aware scheduler should solve some of your utilization problems,
but if you want to better protect HBase from being smothered by
MapReduce tasks, you might also want to use OS resource management
facilities. At the very least, you could renice HBase to give it a
higher CPU share. In newer Linux kernels, you can also try partitioning
memory using Linux Containers, or if you're okay with using Solaris, you
can use projects and zones there. I'm curious how other HBase users deal
with this problem. My guess is that most people set task limits
defensively to avoid oversubscription.

Matei

On Jan 7, 2010, at 6:09 PM, Clements, Michael wrote:

> We have built a Hadoop prototype that is using Map-Reduce on HDFS &
> HBase to analyze web traffic data. It is deployed on a 4 node cluster
of
> commodity 4-core machines running Linux. It works, but we are getting
> pitiful performance.
> 
> The main problem is the way the Hadoop task tracker allocates tasks to
> machines. One simply configures a max # of tasks per machine, and
Hadoop
> blindly gives it tasks until reaching that limit. The problem is, we
> have to configure these limits artificially low to prevent the machine
> from exploding when it has to do other work.
> 
> For example, consider a Map-Red task that loads data into HBase. As
the
> HBase table grows, it splits new regions onto other servers. These are
> the same servers that are running Map-Red tasks. So whatever server
> happens to be hosting the region that all the Map-Red tasks are using,
> gets overloaded, the HBase process gets slow to respond, and the
MapRed
> job fails.
> 
> Another example is that not all tasks are created equal. Sometimes 10
> tasks per machine is ideal, sometimes only 2. It is tedious and clumsy
> to configure each job individually for how many tasks it should have
per
> server.
> 
> The only way we've found around this is to configure the Map-Red tasks
> limit so low, it can still handle any extra load that comes its way
(for
> example, becoming an HBase region server). But this means all but one
of
> the machines in the cluster are mostly idle (just in case they become
> the HBase region server) which gives poor overall performance. I would
> expect a simple load based task distribution to run 3-10 times faster,
> since Hadoop is currently using only a small fraction of the available
> processing power in each machine.
> 
> We could separate HBase into other servers instead of sharing the same
> machines with Map-Red tasks. But this at best only reduces the
magnitude
> of the problem, as it is still a static allocation that wastes
machines
> with idle CPU.
> 
> Load based task distribution would automatically compensate for this
and
> dynamically adjust as the job runs. Imagine if the Hadoop task tracker
> allocated tasks to machines based on each machine's load average.
> Instead of configuring maximum task count in mapred-site.xml, we'd
like
> to configure a target load average (for example 2.0). This would give
> dynamic task balancing that would automatically compensate for
different
> tasks, different hardware, etc.
> 
> As the HBase region server moved from machine A to B, B's load average
> would increase and the Hadoop task tracker would stop giving it jobs.
> Meanwhile, A's load average would drop and Hadoop would give it more
> tasks. Each machine in the cluster would run steadily at a targeted
load
> average which gives maximum throughput.
> 
> I don't believe the "Fair Scheduler" would solve this problem, because
> it shares jobs across the cluster, not the tasks of each job. My
> understanding is that it still uses the same static approach to
> allocating the tasks.
> 
> But within the FairScheduler source code I found a "LoadManager"
> interface. It has one implementation, CapBasedLoadManager, which
> allocates tasks based on max counts.
> 
> We could implement our own LoadManager which uses the system load
> average [for example by checking
> java.lang.management.OperatingSystemMXBean.getSystemLoadAverage()].
But
> this is specific to the FairScheduler. Ideally, such a task balancer
> could be used with ANY job scheduler, not just the fair scheduler.
> 
> That is, one shouldn't need to use the fair scheduler just to get load
> average based task distribution for each job. This should be done
> independently, regardless of how jobs are prioritized in the cluster.
> 
> Does the community have a good solution to this problem? Does the
> LoadManager approach make sense? How could it be applied more
generally
> (not just within FairScheduler)?
> 
> Thanks
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
> 


Re: load balanced task distribution

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
Hi Michael,

The Fair Scheduler's LoadManager was indeed put in place to allow for resource-aware scheduling in the future. Actually, Scott Chen from Facebook is currently working towards this feature. His latest patch related to it is https://issues.apache.org/jira/browse/MAPREDUCE-1218, which puts in place a framework to collect resource utilization information on TaskTrackers that can later be used by a LoadManager.

Unfortunately, I'm not sure that something like LoadManager will be standardized across schedulers soon, partly because people are still experimenting with how to build schedulers and want to maintain low-level control. However, standardizing may not be needed because the feature differences between schedulers are decreasing. For example, in Hadoop 0.21, the Fair Scheduler will support FIFO + priorities for jobs in the same pool, allowing it to emulate the default FIFO scheduler if desired (or allowing you to have 2 pools with different weights and FIFO inside each one, etc). Hopefully this is enough flexibility to be useful. Of course, if one interface proves successful, other schedulers might also adopt it.

A load-aware scheduler should solve some of your utilization problems, but if you want to better protect HBase from being smothered by MapReduce tasks, you might also want to use OS resource management facilities. At the very least, you could renice HBase to give it a higher CPU share. In newer Linux kernels, you can also try partitioning memory using Linux Containers, or if you're okay with using Solaris, you can use projects and zones there. I'm curious how other HBase users deal with this problem. My guess is that most people set task limits defensively to avoid oversubscription.

Matei

On Jan 7, 2010, at 6:09 PM, Clements, Michael wrote:

> We have built a Hadoop prototype that is using Map-Reduce on HDFS &
> HBase to analyze web traffic data. It is deployed on a 4 node cluster of
> commodity 4-core machines running Linux. It works, but we are getting
> pitiful performance.
> 
> The main problem is the way the Hadoop task tracker allocates tasks to
> machines. One simply configures a max # of tasks per machine, and Hadoop
> blindly gives it tasks until reaching that limit. The problem is, we
> have to configure these limits artificially low to prevent the machine
> from exploding when it has to do other work.
> 
> For example, consider a Map-Red task that loads data into HBase. As the
> HBase table grows, it splits new regions onto other servers. These are
> the same servers that are running Map-Red tasks. So whatever server
> happens to be hosting the region that all the Map-Red tasks are using,
> gets overloaded, the HBase process gets slow to respond, and the MapRed
> job fails.
> 
> Another example is that not all tasks are created equal. Sometimes 10
> tasks per machine is ideal, sometimes only 2. It is tedious and clumsy
> to configure each job individually for how many tasks it should have per
> server.
> 
> The only way we've found around this is to configure the Map-Red tasks
> limit so low, it can still handle any extra load that comes its way (for
> example, becoming an HBase region server). But this means all but one of
> the machines in the cluster are mostly idle (just in case they become
> the HBase region server) which gives poor overall performance. I would
> expect a simple load based task distribution to run 3-10 times faster,
> since Hadoop is currently using only a small fraction of the available
> processing power in each machine.
> 
> We could separate HBase into other servers instead of sharing the same
> machines with Map-Red tasks. But this at best only reduces the magnitude
> of the problem, as it is still a static allocation that wastes machines
> with idle CPU.
> 
> Load based task distribution would automatically compensate for this and
> dynamically adjust as the job runs. Imagine if the Hadoop task tracker
> allocated tasks to machines based on each machine's load average.
> Instead of configuring maximum task count in mapred-site.xml, we'd like
> to configure a target load average (for example 2.0). This would give
> dynamic task balancing that would automatically compensate for different
> tasks, different hardware, etc.
> 
> As the HBase region server moved from machine A to B, B's load average
> would increase and the Hadoop task tracker would stop giving it jobs.
> Meanwhile, A's load average would drop and Hadoop would give it more
> tasks. Each machine in the cluster would run steadily at a targeted load
> average which gives maximum throughput.
> 
> I don't believe the "Fair Scheduler" would solve this problem, because
> it shares jobs across the cluster, not the tasks of each job. My
> understanding is that it still uses the same static approach to
> allocating the tasks.
> 
> But within the FairScheduler source code I found a "LoadManager"
> interface. It has one implementation, CapBasedLoadManager, which
> allocates tasks based on max counts.
> 
> We could implement our own LoadManager which uses the system load
> average [for example by checking
> java.lang.management.OperatingSystemMXBean.getSystemLoadAverage()]. But
> this is specific to the FairScheduler. Ideally, such a task balancer
> could be used with ANY job scheduler, not just the fair scheduler.
> 
> That is, one shouldn't need to use the fair scheduler just to get load
> average based task distribution for each job. This should be done
> independently, regardless of how jobs are prioritized in the cluster.
> 
> Does the community have a good solution to this problem? Does the
> LoadManager approach make sense? How could it be applied more generally
> (not just within FairScheduler)?
> 
> Thanks
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
>