You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Scott <sk...@weather.com> on 2009/06/10 18:40:18 UTC

Hadoop streaming - No room for reduce task error

Complete newby map/reduce question here.  I am using hadoop streaming as 
I come from a Perl background, and am trying to prototype/test a process 
to load/clean-up ad server log lines from multiple input files into one 
large file on the hdfs that can then be used as the source of a hive db 
table. 

I have a perl map script that reads an input line from stdin, does the 
needed cleanup/manipulation, and writes back to stdout.    I don't 
really need a reduce step, as I don't care what order the lines are 
written in, and there is no summary data to produce.  When I run the job 
with -reducer NONE I get valid output, however I get multiple part-xxxxx 
files rather than one big file. 

So I wrote a trivial 'reduce' script that reads from stdin and simply 
splits the key/value, and writes the value back to stdout.

I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper 
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer 
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input 
/logs/*.log -output test9

The code I have works when given a small set of input files.  However, I 
get the following error when attempting to run the code on a large set 
of input files:

hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 
15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for 
reduce task. Node 
tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 
bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered in 
memory prior to running the reduce step?  If so, what can I change to 
stop the buffering?  I just need the map output to go directly to one 
large file.

Thanks,
Scott

Re: Hadoop streaming - No room for reduce task error

Posted by jason hadoop <ja...@gmail.com>.

The reduce output may spill to disk during the sort, and if it expected to
be larger than the partition free space, unless the machine/jvm has a hugh
allowed memory space, the data will spill to disk during the sort.
If I did my math correctly, you are trying to push ~2TB through the single
reduce.

as for the part-XXXX files, if you have the number of reduces set to zero,
you will get N part files, where N is the number of map tasks.

If you absolutely must have it all go to one reduce, you will need to
increase the free disk space. I think 19.1 preserves compression for the map
output, so you could try enabling compression for map output.

If you have many nodes, you can set the number of reduces to some number and
then use sort -M on the part files, to merge sort them, assuming your reduce
preserves ordering.

Try adding these parameters to your job line:
-D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK

BTW, /bin/cat works fine as an identity mapper or an identity reducer


On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Scott,
> It turns out that Alex's answer was mistaken - your error is actually
> coming
> from lack of disk space on the TT that has been assigned the reduce task.
> Specifically, there is not enough space in mapred.local.dir. You'll need to
> change your mapred.local.dir to point to a partition that has enough space
> to contain your reduce output.
>
> As for why this is the case, I hope someone will pipe up. It seems to me
> that reduce output can go directly to the target filesystem without using
> space on mapred.local.dir.
>
> Thanks
> -Todd
>
> On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard <al...@cloudera.com>
> wrote:
>
> > What is mapred.child.ulimit set to?  This configuration options specifics
> > how much memory child processes are allowed to have.  You may want to up
> > this limit and see what happens.
> >
> > Let me know if that doesn't get you anywhere.
> >
> > Alex
> >
> > On Wed, Jun 10, 2009 at 9:40 AM, Scott <sk...@weather.com> wrote:
> >
> > > Complete newby map/reduce question here.  I am using hadoop streaming
> as
> > I
> > > come from a Perl background, and am trying to prototype/test a process
> to
> > > load/clean-up ad server log lines from multiple input files into one
> > large
> > > file on the hdfs that can then be used as the source of a hive db
> table.
> > > I have a perl map script that reads an input line from stdin, does the
> > > needed cleanup/manipulation, and writes back to stdout.    I don't
> really
> > > need a reduce step, as I don't care what order the lines are written
> in,
> > and
> > > there is no summary data to produce.  When I run the job with -reducer
> > NONE
> > > I get valid output, however I get multiple part-xxxxx files rather than
> > one
> > > big file.
> > > So I wrote a trivial 'reduce' script that reads from stdin and simply
> > > splits the key/value, and writes the value back to stdout.
> > >
> > > I am executing the code as follows:
> > >
> > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
> > /logs/*.log
> > > -output test9
> > >
> > > The code I have works when given a small set of input files.  However,
> I
> > > get the following error when attempting to run the code on a large set
> of
> > > input files:
> > >
> > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
> > 15:43:00,905
> > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
> > Node
> > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has
> 2004049920
> > > bytes free; but we expect reduce input to take 22138478392
> > >
> > > I assume this is because the all the map output is being buffered in
> > memory
> > > prior to running the reduce step?  If so, what can I change to stop the
> > > buffering?  I just need the map output to go directly to one large
> file.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: Hadoop streaming - No room for reduce task error

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Scott,
It turns out that Alex's answer was mistaken - your error is actually coming
from lack of disk space on the TT that has been assigned the reduce task.
Specifically, there is not enough space in mapred.local.dir. You'll need to
change your mapred.local.dir to point to a partition that has enough space
to contain your reduce output.

As for why this is the case, I hope someone will pipe up. It seems to me
that reduce output can go directly to the target filesystem without using
space on mapred.local.dir.

Thanks
-Todd

On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard <al...@cloudera.com> wrote:

> What is mapred.child.ulimit set to?  This configuration options specifics
> how much memory child processes are allowed to have.  You may want to up
> this limit and see what happens.
>
> Let me know if that doesn't get you anywhere.
>
> Alex
>
> On Wed, Jun 10, 2009 at 9:40 AM, Scott <sk...@weather.com> wrote:
>
> > Complete newby map/reduce question here.  I am using hadoop streaming as
> I
> > come from a Perl background, and am trying to prototype/test a process to
> > load/clean-up ad server log lines from multiple input files into one
> large
> > file on the hdfs that can then be used as the source of a hive db table.
> > I have a perl map script that reads an input line from stdin, does the
> > needed cleanup/manipulation, and writes back to stdout.    I don't really
> > need a reduce step, as I don't care what order the lines are written in,
> and
> > there is no summary data to produce.  When I run the job with -reducer
> NONE
> > I get valid output, however I get multiple part-xxxxx files rather than
> one
> > big file.
> > So I wrote a trivial 'reduce' script that reads from stdin and simply
> > splits the key/value, and writes the value back to stdout.
> >
> > I am executing the code as follows:
> >
> > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
> /logs/*.log
> > -output test9
> >
> > The code I have works when given a small set of input files.  However, I
> > get the following error when attempting to run the code on a large set of
> > input files:
> >
> > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
> 15:43:00,905
> > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
> Node
> > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
> > bytes free; but we expect reduce input to take 22138478392
> >
> > I assume this is because the all the map output is being buffered in
> memory
> > prior to running the reduce step?  If so, what can I change to stop the
> > buffering?  I just need the map output to go directly to one large file.
> >
> > Thanks,
> > Scott
> >
> >
>

Re: Hadoop streaming - No room for reduce task error

Posted by Alex Loddengaard <al...@cloudera.com>.

What is mapred.child.ulimit set to?  This configuration options specifics
how much memory child processes are allowed to have.  You may want to up
this limit and see what happens.

Let me know if that doesn't get you anywhere.

Alex

On Wed, Jun 10, 2009 at 9:40 AM, Scott <sk...@weather.com> wrote:

> Complete newby map/reduce question here.  I am using hadoop streaming as I
> come from a Perl background, and am trying to prototype/test a process to
> load/clean-up ad server log lines from multiple input files into one large
> file on the hdfs that can then be used as the source of a hive db table.
> I have a perl map script that reads an input line from stdin, does the
> needed cleanup/manipulation, and writes back to stdout.    I don't really
> need a reduce step, as I don't care what order the lines are written in, and
> there is no summary data to produce.  When I run the job with -reducer NONE
> I get valid output, however I get multiple part-xxxxx files rather than one
> big file.
> So I wrote a trivial 'reduce' script that reads from stdin and simply
> splits the key/value, and writes the value back to stdout.
>
> I am executing the code as follows:
>
> ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log
> -output test9
>
> The code I have works when given a small set of input files.  However, I
> get the following error when attempting to run the code on a large set of
> input files:
>
> hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905
> WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
> tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
> bytes free; but we expect reduce input to take 22138478392
>
> I assume this is because the all the map output is being buffered in memory
> prior to running the reduce step?  If so, what can I change to stop the
> buffering?  I just need the map output to go directly to one large file.
>
> Thanks,
> Scott
>
>