You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Alex Munteanu <al...@geraeteturnen.com> on 2010/06/03 10:45:26 UTC

Fwd: how to set max map tasks individually for each job?

Hello,

I am running several different mapreduce jobs. For some of them it is
better to have a rather high number of running map tasks per node,
whereas others do very intensive read operations on our database
resulting in read timeouts. So for these jobs I'd like to set a much
smaller limit of concurrently running map tasks.

I tried to overwrite the "mapred.tasktracker.map.tasks.maximum" value in
our job setup but it seems to be a global setting since it affects the
tasktrackers, not the scheduling component.
Also i've found https://issues.apache.org/jira/browse/HADOOP-5170 on the
web. It seems to be exactly what I need but the changes seem not to be
in the current 0.20.2 release which I am using and they also seem to
involve the JobConf class which for now is marked deprecated.

So I have no idea how to do this without changing the global tasktracker map task maximum value and
restarting the system.

Alex


Re: how to set max map tasks individually for each job?

Posted by Eric Sammer <es...@cloudera.com>.
On Thu, Jun 3, 2010 at 4:45 AM, Alex Munteanu <al...@geraeteturnen.com> wrote:
> Hello,
>
> I am running several different mapreduce jobs. For some of them it is
> better to have a rather high number of running map tasks per node,
> whereas others do very intensive read operations on our database
> resulting in read timeouts. So for these jobs I'd like to set a much
> smaller limit of concurrently running map tasks.
>
> I tried to overwrite the "mapred.tasktracker.map.tasks.maximum" value in
> our job setup but it seems to be a global setting since it affects the
> tasktrackers, not the scheduling component.

That's correct.

> Also i've found https://issues.apache.org/jira/browse/HADOOP-5170 on the
> web. It seems to be exactly what I need but the changes seem not to be
> in the current 0.20.2 release which I am using and they also seem to
> involve the JobConf class which for now is marked deprecated.

There are two parts here. Regarding HADOOP-5170, you can see that it
was strongly debated in the JIRA comments. This patch was backed out
of 0.21 (the version it was scheduled to part of) and the author opted
to submit it as part of the Fair Scheduler rather than Hadoop MR. I'm
not sure of the exact status as to its inclusion in the fair scheduler
code base.

While the JobConf (and many related mapred.* classes) are marked
@Deprecated the reality is that they will probably be un-deprecated
for the next release. They'll be around for a while.

> So I have no idea how to do this without changing the global tasktracker map
> task maximum value and
> restarting the system.

Unfortunately, there's no good way to handle this right now. You can
use the fair scheduler to create two pools with varying max tasks, but
that's cluster wide, not per host so I don't think that will be
helpful. A better option is to pack more work into each task in the
"lighter" of your two jobs so they have similar performance
characteristics, if possible. Of course, easier said than done, I
know.

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: how to set max map tasks individually for each job?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jun 3, 2010, at 1:45 AM, Alex Munteanu wrote:
> I am running several different mapreduce jobs. For some of them it is
> better to have a rather high number of running map tasks per node,
> whereas others do very intensive read operations on our database
> resulting in read timeouts. So for these jobs I'd like to set a much
> smaller limit of concurrently running map tasks.


IIRC, 0.21+capacity scheduler has some capabilities that might be useful here.  You can set a (global) default heap size per task.   For those jobs that you want to limit, you can double the size and the capacity scheduler will only schedule one of those tasks instead of two.