You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Zeev Milin <ze...@gmail.com> on 2009/08/06 00:59:33 UTC

Maps running - how to increase?

I have a map/reduce job that has a total of 6000 map tasks. The issue is
that the number of maps that is "running" at any given time is 6 (number of
nodes) and rest are pending. Does anyone know how to force the cluster to
run more maps in parallel to increase the throughput? This is the only job
that is running on this cluster.

Cluster summary:  0.19.2, 6 nodes, Map tasks capacity: 192, Avg tasks/Node:
64

Thanks,
Zeev

Re: Maps running - how to increase?

Posted by Aaron Kimball <aa...@cloudera.com>.
I don't know that your load-in speed is going to dramatically
increase. There's a number of parameters that adjust aspects of
MapReduce, but HDFS more or less works out of the box. You should run
some monitoring on your nodes (ganglia, nagios) or check out what
they're doing with top, iotop and iftop to see where you're
experiencing bottlenecks.
- Aaron

On Thu, Aug 6, 2009 at 11:41 AM, Zeev Milin<ze...@gmail.com> wrote:
> Thanks Aaron,
>
> I changed the settings in hadoop-site.xml file on all the machines. BTW,
> some settings are only reflected on the job level when I change the
> hadoop-default file, not sure why hadoop-site is being ignored (ex:
> mapred.tasktracker.map.tasks.maximum).
>
> The files I am trying load are fairly small (~4MB on average). The
> configuration of each machine is: 2 dual cores (Xeon, 2.33Ghz), 8GB ram and
> a local SCSI hard drive. (total of 6 nodes)
>
> I will look into the article you mentioned, I understand that to load the
> files is going to be slow, was just wondering why the machines are not being
> utilized and mostly idle when more maps can be run in parallel. Maps running
> is always 6.
>
> Another option is to load one 20GB file but currently the speed is fairly
> slow in my opinion: 1GB in 1.5min. What kind of tuning can be done to
> speedup the load into hdfs? If you have any recommendation for specific
> parameters that might help it will be great.
>
> Thanks,
> Zeev
>

Re: Maps running - how to increase?

Posted by Zeev Milin <ze...@gmail.com>.
Thanks Aaron,

I changed the settings in hadoop-site.xml file on all the machines. BTW,
some settings are only reflected on the job level when I change the
hadoop-default file, not sure why hadoop-site is being ignored (ex:
mapred.tasktracker.map.tasks.maximum).

The files I am trying load are fairly small (~4MB on average). The
configuration of each machine is: 2 dual cores (Xeon, 2.33Ghz), 8GB ram and
a local SCSI hard drive. (total of 6 nodes)

I will look into the article you mentioned, I understand that to load the
files is going to be slow, was just wondering why the machines are not being
utilized and mostly idle when more maps can be run in parallel. Maps running
is always 6.

Another option is to load one 20GB file but currently the speed is fairly
slow in my opinion: 1GB in 1.5min. What kind of tuning can be done to
speedup the load into hdfs? If you have any recommendation for specific
parameters that might help it will be great.

Thanks,
Zeev

Re: Maps running - how to increase?

Posted by Aaron Kimball <aa...@cloudera.com>.
Is that setting in the hadoop-site.xml file on every node? Each tasktracker
reads in that file once and sets its max map tasks from that. There's no way
to control this setting on a per-job basis or from the client (submitting)
system. If you've changed hadoop-site.xml after starting the tasktracker,
you need to restart the tasktracker daemon on each node.

Note that 32 maps/node is considered a *lot*. This will likely not provide
you with optimal throughput, since they'll be competing for cores, RAM, I/O,
etc. ...Unless you've got some really super-charged machines in your
datacenter :grin:

Also, in terms of optimizing your job -- do you really have 6,000 big files
worth reading? Or are you running a job over 6,000 small files (where small
means less than 100 MB or so)? If the latter, consider using
MultiFileInputFormat to allow each task to operate on multiple files. See
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ for some
more detail. Even after all 6,000 map tasks run, you'll have to deal with
reassembling 6,000 intermediate data shards into 6 or 12 reduce tasks. This
will also be slow, unless you bunch up multiple files into a single task.

Cheers,
- Aaron


On Wed, Aug 5, 2009 at 5:06 PM, Zeev Milin <ze...@gmail.com> wrote:

> I now see that the mapred.tasktracker.map.tasks.maximum=32 on the job level
> and still only 6 maps running and 5000+ pending..
>
> Not sure how to force the cluster to run more maps.
>

Re: Maps running - how to increase?

Posted by Zeev Milin <ze...@gmail.com>.
I now see that the mapred.tasktracker.map.tasks.maximum=32 on the job level
and still only 6 maps running and 5000+ pending..

Not sure how to force the cluster to run more maps.

Re: Maps running - how to increase?

Posted by Zeev Milin <ze...@gmail.com>.
This is the setting in hadoop-site.xml file:

<property>
 <name>mapred.tasktracker.map.tasks.maximum</name>
 <value>32</value>
</property>

When I look at the job configuration file (xml), I see that this parameter
is set to 2. Not sure why the hadoop-site value is not being used.

Re: Maps running - how to increase?

Posted by Tim Sell <tr...@gmail.com>.
have you set:
mapred.tasktracker.map.tasks.maximum
?

That specifies the number of maps that can run on single node at a time.

2009/8/5 Zeev Milin <ze...@gmail.com>:
> I have a map/reduce job that has a total of 6000 map tasks. The issue is
> that the number of maps that is "running" at any given time is 6 (number of
> nodes) and rest are pending. Does anyone know how to force the cluster to
> run more maps in parallel to increase the throughput? This is the only job
> that is running on this cluster.
>
> Cluster summary:  0.19.2, 6 nodes, Map tasks capacity: 192, Avg tasks/Node:
> 64
>
> Thanks,
> Zeev
>