You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2008/12/04 15:42:05 UTC

Optimized way

Hi,

I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
I am running a shell command in Map function of my program and this shell
command takes one file as an input. Many of such files are copied in the
HDFS.

So in summary map function will run a command like ./run <file1>
<outputfile1>

Could you please suggest the optimized way to do this..like if I can use
multi core processing of nodes and many of such maps in parallel.

Thanks,
Aayush

Re: Optimized way

Posted by Alex Loddengaard <al...@cloudera.com>.

Well, Map/Reduce and Hadoop by definition run maps in parallel.  I think
you're interested in the following two configuration settings:

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

These go in hadoop-site.xml and will set the number of map and reduce tasks
for each tasktracker (node).  Learn more here:

<
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuring+the+Hadoop+Daemons
>

Map tasks + reduce tasks should be slightly above the number of cores you
have per node.  So if you have 8 cores per node, setting map tasks to 6 and
reduce tasks to 4 would probably be good.

Hope this helps.

Alex

On Thu, Dec 4, 2008 at 6:42 AM, Aayush Garg <aa...@gmail.com> wrote:

> Hi,
>
> I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
> I am running a shell command in Map function of my program and this shell
> command takes one file as an input. Many of such files are copied in the
> HDFS.
>
> So in summary map function will run a command like ./run <file1>
> <outputfile1>
>
> Could you please suggest the optimized way to do this..like if I can use
> multi core processing of nodes and many of such maps in parallel.
>
> Thanks,
> Aayush
>

Re: Optimized way

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

Hi Aayush,
 Do you want one map to run one command? You can give input file 
consisting of lines of <file> <outputfile>. Use NLineInputFormat which 
splits N lines of input as one split. i.e gives N lines to one map for 
processing. By default, N is one. Then your map can just run the shell 
command on input line. Will this optimize your need?
More details @
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
Thanks,
Amareshwari
Aayush Garg wrote:
> Hi,
>
> I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
> I am running a shell command in Map function of my program and this shell
> command takes one file as an input. Many of such files are copied in the
> HDFS.
>
> So in summary map function will run a command like ./run <file1>
> <outputfile1>
>
> Could you please suggest the optimized way to do this..like if I can use
> multi core processing of nodes and many of such maps in parallel.
>
> Thanks,
> Aayush
>
>