You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2008/12/04 15:42:05 UTC
Optimized way
Hi,
I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
I am running a shell command in Map function of my program and this shell
command takes one file as an input. Many of such files are copied in the
HDFS.
So in summary map function will run a command like ./run <file1>
<outputfile1>
Could you please suggest the optimized way to do this..like if I can use
multi core processing of nodes and many of such maps in parallel.
Thanks,
Aayush
Re: Optimized way
Posted by Alex Loddengaard <al...@cloudera.com>.
Well, Map/Reduce and Hadoop by definition run maps in parallel. I think
you're interested in the following two configuration settings:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
These go in hadoop-site.xml and will set the number of map and reduce tasks
for each tasktracker (node). Learn more here:
<
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Configuring+the+Hadoop+Daemons
>
Map tasks + reduce tasks should be slightly above the number of cores you
have per node. So if you have 8 cores per node, setting map tasks to 6 and
reduce tasks to 4 would probably be good.
Hope this helps.
Alex
On Thu, Dec 4, 2008 at 6:42 AM, Aayush Garg <aa...@gmail.com> wrote:
> Hi,
>
> I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
> I am running a shell command in Map function of my program and this shell
> command takes one file as an input. Many of such files are copied in the
> HDFS.
>
> So in summary map function will run a command like ./run <file1>
> <outputfile1>
>
> Could you please suggest the optimized way to do this..like if I can use
> multi core processing of nodes and many of such maps in parallel.
>
> Thanks,
> Aayush
>
Re: Optimized way
Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Hi Aayush,
Do you want one map to run one command? You can give input file
consisting of lines of <file> <outputfile>. Use NLineInputFormat which
splits N lines of input as one split. i.e gives N lines to one map for
processing. By default, N is one. Then your map can just run the shell
command on input line. Will this optimize your need?
More details @
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
Thanks,
Amareshwari
Aayush Garg wrote:
> Hi,
>
> I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
> I am running a shell command in Map function of my program and this shell
> command takes one file as an input. Many of such files are copied in the
> HDFS.
>
> So in summary map function will run a command like ./run <file1>
> <outputfile1>
>
> Could you please suggest the optimized way to do this..like if I can use
> multi core processing of nodes and many of such maps in parallel.
>
> Thanks,
> Aayush
>
>