You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by RajVish <ra...@yahoo.com> on 2010/07/20 21:11:39 UTC

Seperate Server Sets for Map and Reduce

We have lots of servers but have limited storage pool. My map jobs are handle
lots of small input files (approx 300Mb Compressed) but the reduce input is
huge ( about 100Gb) requiring lots of temporary and local storage. I would
like to divide my server pool into two kinds - one set with a small disks (
for map jobs) and a few with big storage ( for the combine and reduce jobs).

Is there something I can do that lets me force the reduce job to run on a
specific nodes?

I have done google searching and searching through some forums but not
found.

-best regards,

Raj 
-- 
View this message in context: http://old.nabble.com/Seperate-Server-Sets-for-Map-and-Reduce-tp29216327p29216327.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Seperate Server Sets for Map and Reduce

Posted by Alex Kozlov <al...@cloudera.com>.

Hi RajVish,

I am just wondering why the reduce input is huge: would increasing the # of
reducers make it smaller or is it 'fixed cost'?  Having the reducer size >>
mapper size definitely makes it a very hard problem to schedule on a
homogeneous cluster, but it also may make it not scalable.

Regarding your question, you can certainly force the mappers/reducers ratio
be different on different nodes using *
mapred.tasktracker.{map,reduce}.tasks.maximum*, but this will have
implications on the data locality and scalability.

Also, you may still end up with the same problem since the mappers cache
their output on a local disk and mapper output == reducer input.

Alex K

On Tue, Jul 20, 2010 at 12:11 PM, RajVish <ra...@yahoo.com> wrote:

>
> We have lots of servers but have limited storage pool. My map jobs are
> handle
> lots of small input files (approx 300Mb Compressed) but the reduce input is
> huge ( about 100Gb) requiring lots of temporary and local storage. I would
> like to divide my server pool into two kinds - one set with a small disks (
> for map jobs) and a few with big storage ( for the combine and reduce
> jobs).
>
> Is there something I can do that lets me force the reduce job to run on a
> specific nodes?
>
> I have done google searching and searching through some forums but not
> found.
>
> -best regards,
>
> Raj
> --
> View this message in context:
> http://old.nabble.com/Seperate-Server-Sets-for-Map-and-Reduce-tp29216327p29216327.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>