You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jacob R Rideout <ap...@jacobrideout.net> on 2010/03/06 17:31:28 UTC

Shuffle In Memory OutOfMemoryError

Hi all,

We are seeing the following error in our reducers of a particular job:

Error: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


After enough reducers fail the entire job fails. This error occurs
regardless of whether mapred.compress.map.output is true. We were able
to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
and.ShuffleRamManager.reserve correctly detect the the memory
available for allocation? I would think that with poor configuration
settings (and default settings in particular) the job may not be as
efficient, but wouldn't die.

Here is some more context in the logs, I have attached the full
reducer log here: http://gist.github.com/323746


2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling 4191933 bytes (435311 raw bytes) into RAM from
attempt_201003060739_0002_m_000061_0
2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
attempt_201003060739_0002_m_000202_0
2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0 adding host
hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
failures
2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201003060739_0002_r_000000_0 : Map output copy failure :
java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
would address the issue in 0.20.2, but it did not. Does anyone have
any comments or suggestions? Is this a bug I should file a JIRA for?

Jacob Rideout
Return Path

Re: Shuffle In Memory OutOfMemoryError

Posted by ba...@gmail.com.
FrrkcyriivlkfkjlkcuhgoyjturopEihymbgfkieNjl$jk

Simon

Simon

Matilda

Boris

lltjhhwgh
Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ted Yu <yu...@gmail.com>
Date: Tue, 9 Mar 2010 14:33:28 
To: <co...@hadoop.apache.org>
Subject: Re: Shuffle In Memory OutOfMemoryError

Andy:
You need to manually apply the patch.

Cheers

On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the
> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran
> the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1
> and continue to have the same Java heap space error.
>
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Tuesday, March 09, 2010 12:56 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> This issue has been resolved in
> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>
> Please apply the patch
> M1182-1v20.patch<
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>
>
> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >  Thanks Ted.  Very helpful.  You are correct that I misunderstood the
> code
> > at ReduceTask.java:1535.  I missed the fact that it's in a IOException
> catch
> > block.  My mistake.  That's what I get for being in a rush.
> >
> >  For what it's worth I did re-run the job with
> > mapred.reduce.parallel.copies set with values from 5 all the way down to
> 1.
> >  All failed with the same error:
> >
> > Error: java.lang.OutOfMemoryError: Java heap space
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >
> >
> >   So from that it does seem like something else might be going on, yes?
>  I
> > need to do some more research.
> >
> >  I appreciate your insights.
> >
> >  Andy
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Sunday, March 07, 2010 3:38 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > My observation is based on this call chain:
> > MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> > ramManager.canFitInMemory(decompressedLength)
> >
> > Basically ramManager.canFitInMemory() makes decision without considering
> > the
> > number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total
> heap
> > may be used in shuffling if default parameters were used.
> > Of course, you should check the value for mapred.reduce.parallel.copies
> to
> > see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
> >
> > About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2
> only
> > has 2731 lines. So I have to guess the location of the code snippet you
> > provided.
> > I found this around line 1535:
> >        } catch (IOException ioe) {
> >          LOG.info("Failed to shuffle from " +
> > mapOutputLoc.getTaskAttemptId(),
> >                   ioe);
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> > Please confirm the line number.
> >
> > If we're looking at the same code, I am afraid I don't see how we can
> > improve it. First, I assume IOException shouldn't happen that often.
> > Second,
> > mapOutput.discard() just sets:
> >          data = null;
> > for in memory case. Even if we call mapOutput.discard() before
> > ramManager.unreserve(), we don't know when GC would kick in and make more
> > memory available.
> > Of course, given the large number of map outputs in your system, it
> became
> > more likely that the root cause from my reasoning made OOME happen
> sooner.
> >
> > Thanks
> >
> > >
> > On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> andy.sautins@returnpath.net
> > >wrote:
> >
> > >
> > >   Ted,
> > >
> > >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > > following.  If you would mind helping me understand I would appreciate
> > it.
> > >
> > >   Looking at the code maxSingleShuffleLimit is only used in determining
> > if
> > > the copy _can_ fit into memory:
> > >
> > >     boolean canFitInMemory(long requestedSize) {
> > >        return (requestedSize < Integer.MAX_VALUE &&
> > >                requestedSize < maxSingleShuffleLimit);
> > >      }
> > >
> > >    It also looks like the RamManager.reserve should wait until memory
> is
> > > available so it should hit a memory limit for that reason.
> > >
> > >    What does seem a little strange to me is the following (
> > ReduceTask.java
> > > starting at 2730 ):
> > >
> > >          // Inform the ram-manager
> > >          ramManager.closeInMemoryFile(mapOutputLength);
> > >          ramManager.unreserve(mapOutputLength);
> > >
> > >          // Discard the map-output
> > >          try {
> > >            mapOutput.discard();
> > >          } catch (IOException ignored) {
> > >            LOG.info("Failed to discard map-output from " +
> > >                     mapOutputLoc.getTaskAttemptId(), ignored);
> > >          }
> > >          mapOutput = null;
> > >
> > >   So to me that looks like the ramManager unreserves the memory before
> > the
> > > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> > the
> > > ramManager unreserves the memory?  If the memory is unreserved before
> the
> > > actual underlying data references are removed then it seems like
> another
> > > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > > previous memory is disposed ( mapOutput.discard() ).
> > >
> > >   Not sure that makes sense.  One thing to note is that the particular
> > job
> > > that is failing does have a good number ( 200k+ ) of map outputs.  The
> > large
> > > number of small map outputs may be why we are triggering a problem.
> > >
> > >   Thanks again for your thoughts.
> > >
> > >   Andy
> > >
> > >
> > > -----Original Message-----
> > > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > > Sent: Sunday, March 07, 2010 1:21 PM
> > > To: common-user@hadoop.apache.org
> > > Cc: Andy Sautins; Ted Yu
> > > Subject: Re: Shuffle In Memory OutOfMemoryError
> > >
> > > Ted,
> > >
> > > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > > some time to write a patch later this week.
> > >
> > > Jacob Rideout
> > >
> > > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > > I think there is mismatch (in ReduceTask.java) between:
> > > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
> 5);
> > > > and:
> > > >        maxSingleShuffleLimit = (long)(maxSize *
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > > >
> > > > because
> > > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > > >
> > > > A JIRA should be filed to correlate the constant 5 above and
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> > apache@jacobrideout.net
> > > >wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> We are seeing the following error in our reducers of a particular
> job:
> > > >>
> > > >> Error: java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> After enough reducers fail the entire job fails. This error occurs
> > > >> regardless of whether mapred.compress.map.output is true. We were
> able
> > > >> to avoid the issue by reducing
> mapred.job.shuffle.input.buffer.percent
> > > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > > >> available for allocation? I would think that with poor configuration
> > > >> settings (and default settings in particular) the job may not be as
> > > >> efficient, but wouldn't die.
> > > >>
> > > >> Here is some more context in the logs, I have attached the full
> > > >> reducer log here: http://gist.github.com/323746
> > > >>
> > > >>
> > > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > > >> attempt_201003060739_0002_m_000061_0
> > > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
> Task
> > > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > > >> attempt_201003060739_0002_m_000202_0
> > > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0 adding host
> > > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
> previous
> > > >> failures
> > > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > > >> java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > > >>
> > > >> Jacob Rideout
> > > >> Return Path
> > > >>
> > > >
> > >
> >
>


RE: Shuffle In Memory OutOfMemoryError

Posted by Andy Sautins <an...@returnpath.net>.
   Ah.  My mistake.  We will apply the patch manually to 0.20.2 and re-run.  Just out of curiosity, why do the release notes for 0.20.2 indicate that MAPREDUCE-1182 is included in the release, but the patch needs to be applied manually.  Is there an additional part of the patch not included in the release?

   Thanks for your help.

   Andy

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Tuesday, March 09, 2010 3:33 PM
To: common-user@hadoop.apache.org
Subject: Re: Shuffle In Memory OutOfMemoryError

Andy:
You need to manually apply the patch.

Cheers

On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the
> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran
> the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1
> and continue to have the same Java heap space error.
>
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Tuesday, March 09, 2010 12:56 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> This issue has been resolved in
> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>
> Please apply the patch
> M1182-1v20.patch<
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>
>
> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >  Thanks Ted.  Very helpful.  You are correct that I misunderstood the
> code
> > at ReduceTask.java:1535.  I missed the fact that it's in a IOException
> catch
> > block.  My mistake.  That's what I get for being in a rush.
> >
> >  For what it's worth I did re-run the job with
> > mapred.reduce.parallel.copies set with values from 5 all the way down to
> 1.
> >  All failed with the same error:
> >
> > Error: java.lang.OutOfMemoryError: Java heap space
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >
> >
> >   So from that it does seem like something else might be going on, yes?
>  I
> > need to do some more research.
> >
> >  I appreciate your insights.
> >
> >  Andy
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Sunday, March 07, 2010 3:38 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > My observation is based on this call chain:
> > MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> > ramManager.canFitInMemory(decompressedLength)
> >
> > Basically ramManager.canFitInMemory() makes decision without considering
> > the
> > number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total
> heap
> > may be used in shuffling if default parameters were used.
> > Of course, you should check the value for mapred.reduce.parallel.copies
> to
> > see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
> >
> > About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2
> only
> > has 2731 lines. So I have to guess the location of the code snippet you
> > provided.
> > I found this around line 1535:
> >        } catch (IOException ioe) {
> >          LOG.info("Failed to shuffle from " +
> > mapOutputLoc.getTaskAttemptId(),
> >                   ioe);
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> > Please confirm the line number.
> >
> > If we're looking at the same code, I am afraid I don't see how we can
> > improve it. First, I assume IOException shouldn't happen that often.
> > Second,
> > mapOutput.discard() just sets:
> >          data = null;
> > for in memory case. Even if we call mapOutput.discard() before
> > ramManager.unreserve(), we don't know when GC would kick in and make more
> > memory available.
> > Of course, given the large number of map outputs in your system, it
> became
> > more likely that the root cause from my reasoning made OOME happen
> sooner.
> >
> > Thanks
> >
> > >
> > On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> andy.sautins@returnpath.net
> > >wrote:
> >
> > >
> > >   Ted,
> > >
> > >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > > following.  If you would mind helping me understand I would appreciate
> > it.
> > >
> > >   Looking at the code maxSingleShuffleLimit is only used in determining
> > if
> > > the copy _can_ fit into memory:
> > >
> > >     boolean canFitInMemory(long requestedSize) {
> > >        return (requestedSize < Integer.MAX_VALUE &&
> > >                requestedSize < maxSingleShuffleLimit);
> > >      }
> > >
> > >    It also looks like the RamManager.reserve should wait until memory
> is
> > > available so it should hit a memory limit for that reason.
> > >
> > >    What does seem a little strange to me is the following (
> > ReduceTask.java
> > > starting at 2730 ):
> > >
> > >          // Inform the ram-manager
> > >          ramManager.closeInMemoryFile(mapOutputLength);
> > >          ramManager.unreserve(mapOutputLength);
> > >
> > >          // Discard the map-output
> > >          try {
> > >            mapOutput.discard();
> > >          } catch (IOException ignored) {
> > >            LOG.info("Failed to discard map-output from " +
> > >                     mapOutputLoc.getTaskAttemptId(), ignored);
> > >          }
> > >          mapOutput = null;
> > >
> > >   So to me that looks like the ramManager unreserves the memory before
> > the
> > > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> > the
> > > ramManager unreserves the memory?  If the memory is unreserved before
> the
> > > actual underlying data references are removed then it seems like
> another
> > > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > > previous memory is disposed ( mapOutput.discard() ).
> > >
> > >   Not sure that makes sense.  One thing to note is that the particular
> > job
> > > that is failing does have a good number ( 200k+ ) of map outputs.  The
> > large
> > > number of small map outputs may be why we are triggering a problem.
> > >
> > >   Thanks again for your thoughts.
> > >
> > >   Andy
> > >
> > >
> > > -----Original Message-----
> > > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > > Sent: Sunday, March 07, 2010 1:21 PM
> > > To: common-user@hadoop.apache.org
> > > Cc: Andy Sautins; Ted Yu
> > > Subject: Re: Shuffle In Memory OutOfMemoryError
> > >
> > > Ted,
> > >
> > > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > > some time to write a patch later this week.
> > >
> > > Jacob Rideout
> > >
> > > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > > I think there is mismatch (in ReduceTask.java) between:
> > > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
> 5);
> > > > and:
> > > >        maxSingleShuffleLimit = (long)(maxSize *
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > > >
> > > > because
> > > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > > >
> > > > A JIRA should be filed to correlate the constant 5 above and
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> > apache@jacobrideout.net
> > > >wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> We are seeing the following error in our reducers of a particular
> job:
> > > >>
> > > >> Error: java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> After enough reducers fail the entire job fails. This error occurs
> > > >> regardless of whether mapred.compress.map.output is true. We were
> able
> > > >> to avoid the issue by reducing
> mapred.job.shuffle.input.buffer.percent
> > > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > > >> available for allocation? I would think that with poor configuration
> > > >> settings (and default settings in particular) the job may not be as
> > > >> efficient, but wouldn't die.
> > > >>
> > > >> Here is some more context in the logs, I have attached the full
> > > >> reducer log here: http://gist.github.com/323746
> > > >>
> > > >>
> > > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > > >> attempt_201003060739_0002_m_000061_0
> > > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
> Task
> > > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > > >> attempt_201003060739_0002_m_000202_0
> > > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0 adding host
> > > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
> previous
> > > >> failures
> > > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > > >> java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > > >>
> > > >> Jacob Rideout
> > > >> Return Path
> > > >>
> > > >
> > >
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Alex Kozlov <al...@cloudera.com>.
In our case there were three things that helped:


   - reduce *mapred.reduce.parallel.copies *(and incur some execution time
   penalty on large clusters)
   - increase the # of reducers (and just have the larger # of output files)
   - decrease the amount of map output (may or may not be possible in your
   case)


Alex K


On Fri, May 7, 2010 at 8:25 PM, Ted Yu <yu...@gmail.com> wrote:

> You need to lower mapred.job.shuffle.input.buffer.percent to 20% or 25%.
> I didn't have time recently to find the root cause in 0.20.2
>
> I was told that shuffle has been rewritten in 0.21
> You may give it a try.
>
> On Fri, May 7, 2010 at 8:08 PM, Bo Shi <bs...@visiblemeasures.com> wrote:
>
> > Hey Ted, any further insights on this?  We're encountering a similar
> > issue (on CD2).  I'll be applying MAPREDUCE-1182 to see if that
> > resolves our case but it sounds like that JIRA didn't completely
> > eliminate the problem for some folks.
> >
> > On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu <yu...@gmail.com> wrote:
> > > I pressed send key a bit early.
> > >
> > > I will have to dig a bit deeper.
> > > Hopefully someone can find reader.close() call after which I will look
> > for
> > > another possible root cause :-)
> > >
> > >
> > > On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> Thanks to Andy for the log he provided.
> > >>
> > >> You can see from the log below that size increased steadily from
> > 341535057
> > >> to 408181692, approaching maxSize. Then OOME:
> > >>
> > >>
> > >> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
> > reserve:
> > >> pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
> > >> maxSize=417601952
> > >> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
> > reserve:
> > >> pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
> > >> maxSize=417601952
> > >> ...
> > >> 2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask:
> > reserve:
> > >> pos=end requestedSize=635753 size=408181692 numPendingRequests=0
> > >> maxSize=417601952
> > >> 2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
> > >> attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
> > >> attempt_201003101826_0001_m_000875_0
> > >>
> > >> 2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003101826_0001_r_000004_0 adding host
> > hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
> > >> 2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
> > >> failures
> > >> 2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201003101826_0001_r_000004_0 : Map output copy failure :
> > >> java.lang.OutOfMemoryError: Java heap space
> > >>         at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
> > >>         at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
> > >>         at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
> > >>         at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
> > >>
> > >> Looking at the call to unreserve() in ReduceTask, two were for
> > IOException
> > >> and the other was for Sanity check (line 1557). Meaning they wouldn't
> be
> > >> called in normal execution path.
> > >>
> > >> I see one call in IFile.InMemoryReader close() method:
> > >>       // Inform the RamManager
> > >>       ramManager.unreserve(bufferSize);
> > >>
> > >> And InMemoryReader is used in createInMemorySegments():
> > >>           Reader<K, V> reader =
> > >>             new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
> > >>                                      mo.data, 0, mo.data.length);
> > >>
> > >> But I don't see reader.close() in ReduceTask file.
> > >>
> > >
> > >
> > >> On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <chrisdo@yahoo-inc.com
> > >wrote:
> > >>
> > >>> I don't think this OOM is a framework bug per se, and given the
> > >>> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning
> > the
> > >>> 0.20 shuffle semantics is likely not worthwhile (though data
> informing
> > >>> improvements to trunk would be excellent). Most likely (and
> > tautologically),
> > >>> ReduceTask simply requires more memory than is available and the job
> > failure
> > >>> can be avoided by either 0) increasing the heap size or 1) lowering
> > >>> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a
> > heap of
> > >>> 1GB. For a reduce fetching >200k map outputs, that's a reasonable,
> even
> > >>> stingy amount of space. -C
> > >>>
> > >>>
> > >>> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:
> > >>>
> > >>>  I verified that size and maxSize are long. This means MR-1182 didn't
> > >>>> resolve
> > >>>> Andy's issue.
> > >>>>
> > >>>> According to Andy:
> > >>>> At the beginning of the job there are 209,754 pending map tasks and
> 32
> > >>>> pending reduce tasks
> > >>>>
> > >>>> My guess is that GC wasn't reclaiming memory fast enough, leading to
> > OOME
> > >>>> because of large number of in-memory shuffle candidates.
> > >>>>
> > >>>> My suggestion for Andy would be to:
> > >>>> 1. add -*verbose*:*gc as JVM parameter
> > >>>> 2. modify reserve() slightly to calculate the maximum outstanding
> > >>>> numPendingRequests and print the maximum.
> > >>>>
> > >>>> Based on the output from above two items, we can discuss solution.
> > >>>> My intuition is to place upperbound on numPendingRequests beyond
> which
> > >>>> canFitInMemory() returns false.
> > >>>> *
> > >>>> My two cents.
> > >>>>
> > >>>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
> > >>>> <ch...@yahoo-inc.com>wrote:
> > >>>>
> > >>>>  That section of code is unmodified in MR-1182. See the patches/svn
> > log.
> > >>>>> -C
> > >>>>>
> > >>>>> Sent from my iPhone
> > >>>>>
> > >>>>>
> > >>>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
> > >>>>>
> > >>>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
> > >>>>>
> > >>>>>> This is what I see in ReduceTask (line 999):
> > >>>>>>   public synchronized boolean reserve(int requestedSize,
> InputStream
> > >>>>>> in)
> > >>>>>>
> > >>>>>>   throws InterruptedException {
> > >>>>>>     // Wait till the request can be fulfilled...
> > >>>>>>     while ((size + requestedSize) > maxSize) {
> > >>>>>>
> > >>>>>> I don't see the fix from MR-1182.
> > >>>>>>
> > >>>>>> That's why I suggested to Andy that he manually apply MR-1182.
> > >>>>>>
> > >>>>>> Cheers
> > >>>>>>
> > >>>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
> > >>>>>> andy.sautins@returnpath.net
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>  Thanks Christopher.
> > >>>>>>>
> > >>>>>>> The heap size for reduce tasks is configured to be 640M (
> > >>>>>>> mapred.child.java.opts set to -Xmx640m ).
> > >>>>>>>
> > >>>>>>> Andy
> > >>>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
> > >>>>>>> Sent: Tuesday, March 09, 2010 5:19 PM
> > >>>>>>> To: common-user@hadoop.apache.org
> > >>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> > >>>>>>>
> > >>>>>>> No, MR-1182 is included in 0.20.2
> > >>>>>>>
> > >>>>>>> What heap size have you set for your reduce tasks? -C
> > >>>>>>>
> > >>>>>>> Sent from my iPhone
> > >>>>>>>
> > >>>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>> Andy:
> > >>>>>>>
> > >>>>>>>> You need to manually apply the patch.
> > >>>>>>>>
> > >>>>>>>> Cheers
> > >>>>>>>>
> > >>>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
> > >>>>>>>>
> > >>>>>>>>  andy.sautins@returnpath.net
> > >>>>>>>
> > >>>>>>>  wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is
> included
> > >>>>>>>>> in the
> > >>>>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend
> > and
> > >>>>>>>>> re-ran
> > >>>>>>>>> the same job scenarios.  Running with
> > mapred.reduce.parallel.copies
> > >>>>>>>>> set to 1
> > >>>>>>>>> and continue to have the same Java heap space error.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
> > >>>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
> > >>>>>>>>> To: common-user@hadoop.apache.org
> > >>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> > >>>>>>>>>
> > >>>>>>>>> This issue has been resolved in
> > >>>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
> > >>>>>>>>>
> > >>>>>>>>> Please apply the patch
> > >>>>>>>>> M1182-1v20.patch<
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> >
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>>>  On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
> > >>>>>>>>>
> > >>>>>>>>>  andy.sautins@returnpath.net
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>  wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  Thanks Ted.  Very helpful.  You are correct that I
> misunderstood
> > >>>>>>>>>> the
> > >>>>>>>>>>
> > >>>>>>>>>>  code
> > >>>>>>>>>
> > >>>>>>>>>  at ReduceTask.java:1535.  I missed the fact that it's in a
> > >>>>>>>>>> IOException
> > >>>>>>>>>>
> > >>>>>>>>>>  catch
> > >>>>>>>>>
> > >>>>>>>>>  block.  My mistake.  That's what I get for being in a rush.
> > >>>>>>>>>>
> > >>>>>>>>>> For what it's worth I did re-run the job with
> > >>>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the
> way
> > >>>>>>>>>> down to
> > >>>>>>>>>>
> > >>>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>  All failed with the same error:
> > >>>>>>>>>>
> > >>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
> > >>>>>>>>>>   at
> > >>>>>>>>>>
> > >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> > >>>>>>>>>>
> > >>>>>>>>> (ReduceTask.java:1195)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> So from that it does seem like something else might be going
> on,
> > >>>>>>>>>> yes?
> > >>>>>>>>>>
> > >>>>>>>>>>  I
> > >>>>>>>>>
> > >>>>>>>>>  need to do some more research.
> > >>>>>>>>>>
> > >>>>>>>>>> I appreciate your insights.
> > >>>>>>>>>>
> > >>>>>>>>>> Andy
> > >>>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
> > >>>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
> > >>>>>>>>>> To: common-user@hadoop.apache.org
> > >>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> > >>>>>>>>>>
> > >>>>>>>>>> My observation is based on this call chain:
> > >>>>>>>>>> MapOutputCopier.run() calling copyOutput() calling
> > getMapOutput()
> > >>>>>>>>>> calling
> > >>>>>>>>>> ramManager.canFitInMemory(decompressedLength)
> > >>>>>>>>>>
> > >>>>>>>>>> Basically ramManager.canFitInMemory() makes decision without
> > >>>>>>>>>> considering
> > >>>>>>>>>> the
> > >>>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7
> of
> > >>>>>>>>>> total
> > >>>>>>>>>>
> > >>>>>>>>>>  heap
> > >>>>>>>>>
> > >>>>>>>>>  may be used in shuffling if default parameters were used.
> > >>>>>>>>>> Of course, you should check the value for
> > >>>>>>>>>> mapred.reduce.parallel.copies
> > >>>>>>>>>>
> > >>>>>>>>>>  to
> > >>>>>>>>>
> > >>>>>>>>>  see if it is 5. If it is 4 or lower, my reasoning wouldn't
> > apply.
> > >>>>>>>>>>
> > >>>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
> > >>>>>>>>>> 0.20.2
> > >>>>>>>>>>
> > >>>>>>>>>>  only
> > >>>>>>>>>
> > >>>>>>>>>  has 2731 lines. So I have to guess the location of the code
> > >>>>>>>>>> snippet you
> > >>>>>>>>>> provided.
> > >>>>>>>>>> I found this around line 1535:
> > >>>>>>>>>>   } catch (IOException ioe) {
> > >>>>>>>>>>     LOG.info("Failed to shuffle from " +
> > >>>>>>>>>> mapOutputLoc.getTaskAttemptId(),
> > >>>>>>>>>>              ioe);
> > >>>>>>>>>>
> > >>>>>>>>>>     // Inform the ram-manager
> > >>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
> > >>>>>>>>>>     ramManager.unreserve(mapOutputLength);
> > >>>>>>>>>>
> > >>>>>>>>>>     // Discard the map-output
> > >>>>>>>>>>     try {
> > >>>>>>>>>>       mapOutput.discard();
> > >>>>>>>>>>     } catch (IOException ignored) {
> > >>>>>>>>>>       LOG.info("Failed to discard map-output from " +
> > >>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
> > >>>>>>>>>>     }
> > >>>>>>>>>> Please confirm the line number.
> > >>>>>>>>>>
> > >>>>>>>>>> If we're looking at the same code, I am afraid I don't see how
> > we
> > >>>>>>>>>> can
> > >>>>>>>>>> improve it. First, I assume IOException shouldn't happen that
> > >>>>>>>>>> often.
> > >>>>>>>>>> Second,
> > >>>>>>>>>> mapOutput.discard() just sets:
> > >>>>>>>>>>     data = null;
> > >>>>>>>>>> for in memory case. Even if we call mapOutput.discard() before
> > >>>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in
> and
> > >>>>>>>>>> make more
> > >>>>>>>>>> memory available.
> > >>>>>>>>>> Of course, given the large number of map outputs in your
> system,
> > it
> > >>>>>>>>>>
> > >>>>>>>>>>  became
> > >>>>>>>>>
> > >>>>>>>>>  more likely that the root cause from my reasoning made OOME
> > happen
> > >>>>>>>>>>
> > >>>>>>>>>>  sooner.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> Thanks
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  andy.sautins@returnpath.net
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  Ted,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure
> > I'm
> > >>>>>>>>>>> following.  If you would mind helping me understand I would
> > >>>>>>>>>>> appreciate
> > >>>>>>>>>>>
> > >>>>>>>>>>>  it.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
> > >>>>>>>>>>> determining
> > >>>>>>>>>>>
> > >>>>>>>>>>>  if
> > >>>>>>>>>>
> > >>>>>>>>>>  the copy _can_ fit into memory:
> > >>>>>>>>>>>
> > >>>>>>>>>>> boolean canFitInMemory(long requestedSize) {
> > >>>>>>>>>>>   return (requestedSize < Integer.MAX_VALUE &&
> > >>>>>>>>>>>           requestedSize < maxSingleShuffleLimit);
> > >>>>>>>>>>>  }
> > >>>>>>>>>>>
> > >>>>>>>>>>> It also looks like the RamManager.reserve should wait until
> > >>>>>>>>>>> memory
> > >>>>>>>>>>>
> > >>>>>>>>>>>  is
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  available so it should hit a memory limit for that reason.
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> What does seem a little strange to me is the following (
> > >>>>>>>>>>>
> > >>>>>>>>>>>  ReduceTask.java
> > >>>>>>>>>>
> > >>>>>>>>>>  starting at 2730 ):
> > >>>>>>>>>>>
> > >>>>>>>>>>>     // Inform the ram-manager
> > >>>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
> > >>>>>>>>>>>     ramManager.unreserve(mapOutputLength);
> > >>>>>>>>>>>
> > >>>>>>>>>>>     // Discard the map-output
> > >>>>>>>>>>>     try {
> > >>>>>>>>>>>       mapOutput.discard();
> > >>>>>>>>>>>     } catch (IOException ignored) {
> > >>>>>>>>>>>       LOG.info("Failed to discard map-output from " +
> > >>>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
> > >>>>>>>>>>>     }
> > >>>>>>>>>>>     mapOutput = null;
> > >>>>>>>>>>>
> > >>>>>>>>>>> So to me that looks like the ramManager unreserves the memory
> > >>>>>>>>>>> before
> > >>>>>>>>>>>
> > >>>>>>>>>>>  the
> > >>>>>>>>>>
> > >>>>>>>>>>  mapOutput is discarded.  Shouldn't the mapOutput be discarded
> > >>>>>>>>>>> _before_
> > >>>>>>>>>>>
> > >>>>>>>>>>>  the
> > >>>>>>>>>>
> > >>>>>>>>>>  ramManager unreserves the memory?  If the memory is
> unreserved
> > >>>>>>>>>>> before
> > >>>>>>>>>>>
> > >>>>>>>>>>>  the
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  actual underlying data references are removed then it seems
> like
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>  another
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  thread can try to allocate memory ( ReduceTask.java:2730 )
> > before
> > >>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Not sure that makes sense.  One thing to note is that the
> > >>>>>>>>>>> particular
> > >>>>>>>>>>>
> > >>>>>>>>>>>  job
> > >>>>>>>>>>
> > >>>>>>>>>>  that is failing does have a good number ( 200k+ ) of map
> > >>>>>>>>>>> outputs.  The
> > >>>>>>>>>>>
> > >>>>>>>>>>>  large
> > >>>>>>>>>>
> > >>>>>>>>>>  number of small map outputs may be why we are triggering a
> > >>>>>>>>>>> problem.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks again for your thoughts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Andy
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > >>>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
> > >>>>>>>>>>> To: common-user@hadoop.apache.org
> > >>>>>>>>>>> Cc: Andy Sautins; Ted Yu
> > >>>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ted,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I
> might
> > >>>>>>>>>>> have
> > >>>>>>>>>>> some time to write a patch later this week.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jacob Rideout
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com
> >
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>  I think there is mismatch (in ReduceTask.java) between:
> > >>>>>>>>>>>>  this.numCopiers =
> > conf.getInt("mapred.reduce.parallel.copies",
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>  5);
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>  and:
> > >>>>>>>>>>
> > >>>>>>>>>>>   maxSingleShuffleLimit = (long)(maxSize *
> > >>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > >>>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> because
> > >>>>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > >>>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *
> > >>>>>>>>>>>> maxSize
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
> > >>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>  apache@jacobrideout.net
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> We are seeing the following error in our reducers of a
> > >>>>>>>>>>>>> particular
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  job:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>  Error: java.lang.OutOfMemoryError: Java heap space
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> > >>>>>>>>>>
> > >>>>>>>>> (ReduceTask.java:1195)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>>>> After enough reducers fail the entire job fails. This error
> > >>>>>>>>>>>>> occurs
> > >>>>>>>>>>>>> regardless of whether mapred.compress.map.output is true.
> We
> > >>>>>>>>>>>>> were
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  able
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>  to avoid the issue by reducing
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>  mapred.job.shuffle.input.buffer.percent
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>  to 20%. Shouldn't the framework via
> > >>>>>>>>>>
> > >>>>>>>>>>>  ShuffleRamManager.canFitInMemory
> > >>>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the
> memory
> > >>>>>>>>>>>>> available for allocation? I would think that with poor
> > >>>>>>>>>>>>> configuration
> > >>>>>>>>>>>>> settings (and default settings in particular) the job may
> not
> > >>>>>>>>>>>>> be as
> > >>>>>>>>>>>>> efficient, but wouldn't die.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Here is some more context in the logs, I have attached the
> > full
> > >>>>>>>>>>>>> reducer log here: http://gist.github.com/323746
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO
> > >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> > >>>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > >>>>>>>>>>>>> attempt_201003060739_0002_m_000061_0
> > >>>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO
> > >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  Task
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>  attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > >>>>>>>>>>
> > >>>>>>>>>>>  attempt_201003060739_0002_m_000202_0
> > >>>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN
> > >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> > >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
> > >>>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4
> > >>>>>>>>>>>>> seconds
> > >>>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO
> > >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> > >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs
> from
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  previous
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>  failures
> > >>>>>>>>>>
> > >>>>>>>>>>>  2010-03-06 07:54:50,223 FATAL
> > >>>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
> > >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy
> > failure :
> > >>>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> > >>>>>>>>>>
> > >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>>>>>>>>
> > >>>>>>>>>    at
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> > >>>>>>>>>>
> > >>>>>>>>> (ReduceTask.java:1195)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
> > >>>>>>>>>>>>> MAPREDUCE-1182
> > >>>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does
> > anyone
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a
> > JIRA
> > >>>>>>>>>>>>> for?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Jacob Rideout
> > >>>>>>>>>>>>> Return Path
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>
> > >>
> > >
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
You need to lower mapred.job.shuffle.input.buffer.percent to 20% or 25%.
I didn't have time recently to find the root cause in 0.20.2

I was told that shuffle has been rewritten in 0.21
You may give it a try.

On Fri, May 7, 2010 at 8:08 PM, Bo Shi <bs...@visiblemeasures.com> wrote:

> Hey Ted, any further insights on this?  We're encountering a similar
> issue (on CD2).  I'll be applying MAPREDUCE-1182 to see if that
> resolves our case but it sounds like that JIRA didn't completely
> eliminate the problem for some folks.
>
> On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu <yu...@gmail.com> wrote:
> > I pressed send key a bit early.
> >
> > I will have to dig a bit deeper.
> > Hopefully someone can find reader.close() call after which I will look
> for
> > another possible root cause :-)
> >
> >
> > On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Thanks to Andy for the log he provided.
> >>
> >> You can see from the log below that size increased steadily from
> 341535057
> >> to 408181692, approaching maxSize. Then OOME:
> >>
> >>
> >> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
> reserve:
> >> pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
> >> maxSize=417601952
> >> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
> reserve:
> >> pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
> >> maxSize=417601952
> >> ...
> >> 2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask:
> reserve:
> >> pos=end requestedSize=635753 size=408181692 numPendingRequests=0
> >> maxSize=417601952
> >> 2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
> >> attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
> >> attempt_201003101826_0001_m_000875_0
> >>
> >> 2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003101826_0001_r_000004_0 adding host
> hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
> >> 2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
> >> failures
> >> 2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201003101826_0001_r_000004_0 : Map output copy failure :
> >> java.lang.OutOfMemoryError: Java heap space
> >>         at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
> >>         at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
> >>         at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
> >>         at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
> >>
> >> Looking at the call to unreserve() in ReduceTask, two were for
> IOException
> >> and the other was for Sanity check (line 1557). Meaning they wouldn't be
> >> called in normal execution path.
> >>
> >> I see one call in IFile.InMemoryReader close() method:
> >>       // Inform the RamManager
> >>       ramManager.unreserve(bufferSize);
> >>
> >> And InMemoryReader is used in createInMemorySegments():
> >>           Reader<K, V> reader =
> >>             new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
> >>                                      mo.data, 0, mo.data.length);
> >>
> >> But I don't see reader.close() in ReduceTask file.
> >>
> >
> >
> >> On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <chrisdo@yahoo-inc.com
> >wrote:
> >>
> >>> I don't think this OOM is a framework bug per se, and given the
> >>> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning
> the
> >>> 0.20 shuffle semantics is likely not worthwhile (though data informing
> >>> improvements to trunk would be excellent). Most likely (and
> tautologically),
> >>> ReduceTask simply requires more memory than is available and the job
> failure
> >>> can be avoided by either 0) increasing the heap size or 1) lowering
> >>> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a
> heap of
> >>> 1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
> >>> stingy amount of space. -C
> >>>
> >>>
> >>> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:
> >>>
> >>>  I verified that size and maxSize are long. This means MR-1182 didn't
> >>>> resolve
> >>>> Andy's issue.
> >>>>
> >>>> According to Andy:
> >>>> At the beginning of the job there are 209,754 pending map tasks and 32
> >>>> pending reduce tasks
> >>>>
> >>>> My guess is that GC wasn't reclaiming memory fast enough, leading to
> OOME
> >>>> because of large number of in-memory shuffle candidates.
> >>>>
> >>>> My suggestion for Andy would be to:
> >>>> 1. add -*verbose*:*gc as JVM parameter
> >>>> 2. modify reserve() slightly to calculate the maximum outstanding
> >>>> numPendingRequests and print the maximum.
> >>>>
> >>>> Based on the output from above two items, we can discuss solution.
> >>>> My intuition is to place upperbound on numPendingRequests beyond which
> >>>> canFitInMemory() returns false.
> >>>> *
> >>>> My two cents.
> >>>>
> >>>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
> >>>> <ch...@yahoo-inc.com>wrote:
> >>>>
> >>>>  That section of code is unmodified in MR-1182. See the patches/svn
> log.
> >>>>> -C
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>
> >>>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>>>>
> >>>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
> >>>>>
> >>>>>> This is what I see in ReduceTask (line 999):
> >>>>>>   public synchronized boolean reserve(int requestedSize, InputStream
> >>>>>> in)
> >>>>>>
> >>>>>>   throws InterruptedException {
> >>>>>>     // Wait till the request can be fulfilled...
> >>>>>>     while ((size + requestedSize) > maxSize) {
> >>>>>>
> >>>>>> I don't see the fix from MR-1182.
> >>>>>>
> >>>>>> That's why I suggested to Andy that he manually apply MR-1182.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
> >>>>>> andy.sautins@returnpath.net
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  Thanks Christopher.
> >>>>>>>
> >>>>>>> The heap size for reduce tasks is configured to be 640M (
> >>>>>>> mapred.child.java.opts set to -Xmx640m ).
> >>>>>>>
> >>>>>>> Andy
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
> >>>>>>> Sent: Tuesday, March 09, 2010 5:19 PM
> >>>>>>> To: common-user@hadoop.apache.org
> >>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>>>>>
> >>>>>>> No, MR-1182 is included in 0.20.2
> >>>>>>>
> >>>>>>> What heap size have you set for your reduce tasks? -C
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Andy:
> >>>>>>>
> >>>>>>>> You need to manually apply the patch.
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>>
> >>>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
> >>>>>>>>
> >>>>>>>>  andy.sautins@returnpath.net
> >>>>>>>
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
> >>>>>>>>> in the
> >>>>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend
> and
> >>>>>>>>> re-ran
> >>>>>>>>> the same job scenarios.  Running with
> mapred.reduce.parallel.copies
> >>>>>>>>> set to 1
> >>>>>>>>> and continue to have the same Java heap space error.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >>>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
> >>>>>>>>> To: common-user@hadoop.apache.org
> >>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>>>>>>>
> >>>>>>>>> This issue has been resolved in
> >>>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
> >>>>>>>>>
> >>>>>>>>> Please apply the patch
> >>>>>>>>> M1182-1v20.patch<
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>  On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
> >>>>>>>>>
> >>>>>>>>>  andy.sautins@returnpath.net
> >>>>>>>>
> >>>>>>>
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  Thanks Ted.  Very helpful.  You are correct that I misunderstood
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>>  code
> >>>>>>>>>
> >>>>>>>>>  at ReduceTask.java:1535.  I missed the fact that it's in a
> >>>>>>>>>> IOException
> >>>>>>>>>>
> >>>>>>>>>>  catch
> >>>>>>>>>
> >>>>>>>>>  block.  My mistake.  That's what I get for being in a rush.
> >>>>>>>>>>
> >>>>>>>>>> For what it's worth I did re-run the job with
> >>>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
> >>>>>>>>>> down to
> >>>>>>>>>>
> >>>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>  All failed with the same error:
> >>>>>>>>>>
> >>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
> >>>>>>>>>>   at
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >>>>>>>>>>
> >>>>>>>>> (ReduceTask.java:1195)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> So from that it does seem like something else might be going on,
> >>>>>>>>>> yes?
> >>>>>>>>>>
> >>>>>>>>>>  I
> >>>>>>>>>
> >>>>>>>>>  need to do some more research.
> >>>>>>>>>>
> >>>>>>>>>> I appreciate your insights.
> >>>>>>>>>>
> >>>>>>>>>> Andy
> >>>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >>>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
> >>>>>>>>>> To: common-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>>>>>>>>
> >>>>>>>>>> My observation is based on this call chain:
> >>>>>>>>>> MapOutputCopier.run() calling copyOutput() calling
> getMapOutput()
> >>>>>>>>>> calling
> >>>>>>>>>> ramManager.canFitInMemory(decompressedLength)
> >>>>>>>>>>
> >>>>>>>>>> Basically ramManager.canFitInMemory() makes decision without
> >>>>>>>>>> considering
> >>>>>>>>>> the
> >>>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
> >>>>>>>>>> total
> >>>>>>>>>>
> >>>>>>>>>>  heap
> >>>>>>>>>
> >>>>>>>>>  may be used in shuffling if default parameters were used.
> >>>>>>>>>> Of course, you should check the value for
> >>>>>>>>>> mapred.reduce.parallel.copies
> >>>>>>>>>>
> >>>>>>>>>>  to
> >>>>>>>>>
> >>>>>>>>>  see if it is 5. If it is 4 or lower, my reasoning wouldn't
> apply.
> >>>>>>>>>>
> >>>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
> >>>>>>>>>> 0.20.2
> >>>>>>>>>>
> >>>>>>>>>>  only
> >>>>>>>>>
> >>>>>>>>>  has 2731 lines. So I have to guess the location of the code
> >>>>>>>>>> snippet you
> >>>>>>>>>> provided.
> >>>>>>>>>> I found this around line 1535:
> >>>>>>>>>>   } catch (IOException ioe) {
> >>>>>>>>>>     LOG.info("Failed to shuffle from " +
> >>>>>>>>>> mapOutputLoc.getTaskAttemptId(),
> >>>>>>>>>>              ioe);
> >>>>>>>>>>
> >>>>>>>>>>     // Inform the ram-manager
> >>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
> >>>>>>>>>>     ramManager.unreserve(mapOutputLength);
> >>>>>>>>>>
> >>>>>>>>>>     // Discard the map-output
> >>>>>>>>>>     try {
> >>>>>>>>>>       mapOutput.discard();
> >>>>>>>>>>     } catch (IOException ignored) {
> >>>>>>>>>>       LOG.info("Failed to discard map-output from " +
> >>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
> >>>>>>>>>>     }
> >>>>>>>>>> Please confirm the line number.
> >>>>>>>>>>
> >>>>>>>>>> If we're looking at the same code, I am afraid I don't see how
> we
> >>>>>>>>>> can
> >>>>>>>>>> improve it. First, I assume IOException shouldn't happen that
> >>>>>>>>>> often.
> >>>>>>>>>> Second,
> >>>>>>>>>> mapOutput.discard() just sets:
> >>>>>>>>>>     data = null;
> >>>>>>>>>> for in memory case. Even if we call mapOutput.discard() before
> >>>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
> >>>>>>>>>> make more
> >>>>>>>>>> memory available.
> >>>>>>>>>> Of course, given the large number of map outputs in your system,
> it
> >>>>>>>>>>
> >>>>>>>>>>  became
> >>>>>>>>>
> >>>>>>>>>  more likely that the root cause from my reasoning made OOME
> happen
> >>>>>>>>>>
> >>>>>>>>>>  sooner.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  andy.sautins@returnpath.net
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  Ted,
> >>>>>>>>>>>
> >>>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure
> I'm
> >>>>>>>>>>> following.  If you would mind helping me understand I would
> >>>>>>>>>>> appreciate
> >>>>>>>>>>>
> >>>>>>>>>>>  it.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
> >>>>>>>>>>> determining
> >>>>>>>>>>>
> >>>>>>>>>>>  if
> >>>>>>>>>>
> >>>>>>>>>>  the copy _can_ fit into memory:
> >>>>>>>>>>>
> >>>>>>>>>>> boolean canFitInMemory(long requestedSize) {
> >>>>>>>>>>>   return (requestedSize < Integer.MAX_VALUE &&
> >>>>>>>>>>>           requestedSize < maxSingleShuffleLimit);
> >>>>>>>>>>>  }
> >>>>>>>>>>>
> >>>>>>>>>>> It also looks like the RamManager.reserve should wait until
> >>>>>>>>>>> memory
> >>>>>>>>>>>
> >>>>>>>>>>>  is
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  available so it should hit a memory limit for that reason.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> What does seem a little strange to me is the following (
> >>>>>>>>>>>
> >>>>>>>>>>>  ReduceTask.java
> >>>>>>>>>>
> >>>>>>>>>>  starting at 2730 ):
> >>>>>>>>>>>
> >>>>>>>>>>>     // Inform the ram-manager
> >>>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
> >>>>>>>>>>>     ramManager.unreserve(mapOutputLength);
> >>>>>>>>>>>
> >>>>>>>>>>>     // Discard the map-output
> >>>>>>>>>>>     try {
> >>>>>>>>>>>       mapOutput.discard();
> >>>>>>>>>>>     } catch (IOException ignored) {
> >>>>>>>>>>>       LOG.info("Failed to discard map-output from " +
> >>>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
> >>>>>>>>>>>     }
> >>>>>>>>>>>     mapOutput = null;
> >>>>>>>>>>>
> >>>>>>>>>>> So to me that looks like the ramManager unreserves the memory
> >>>>>>>>>>> before
> >>>>>>>>>>>
> >>>>>>>>>>>  the
> >>>>>>>>>>
> >>>>>>>>>>  mapOutput is discarded.  Shouldn't the mapOutput be discarded
> >>>>>>>>>>> _before_
> >>>>>>>>>>>
> >>>>>>>>>>>  the
> >>>>>>>>>>
> >>>>>>>>>>  ramManager unreserves the memory?  If the memory is unreserved
> >>>>>>>>>>> before
> >>>>>>>>>>>
> >>>>>>>>>>>  the
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  actual underlying data references are removed then it seems like
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>  another
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  thread can try to allocate memory ( ReduceTask.java:2730 )
> before
> >>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
> >>>>>>>>>>>
> >>>>>>>>>>> Not sure that makes sense.  One thing to note is that the
> >>>>>>>>>>> particular
> >>>>>>>>>>>
> >>>>>>>>>>>  job
> >>>>>>>>>>
> >>>>>>>>>>  that is failing does have a good number ( 200k+ ) of map
> >>>>>>>>>>> outputs.  The
> >>>>>>>>>>>
> >>>>>>>>>>>  large
> >>>>>>>>>>
> >>>>>>>>>>  number of small map outputs may be why we are triggering a
> >>>>>>>>>>> problem.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks again for your thoughts.
> >>>>>>>>>>>
> >>>>>>>>>>> Andy
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> >>>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
> >>>>>>>>>>> To: common-user@hadoop.apache.org
> >>>>>>>>>>> Cc: Andy Sautins; Ted Yu
> >>>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>>>>>>>>>
> >>>>>>>>>>> Ted,
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
> >>>>>>>>>>> have
> >>>>>>>>>>> some time to write a patch later this week.
> >>>>>>>>>>>
> >>>>>>>>>>> Jacob Rideout
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>  I think there is mismatch (in ReduceTask.java) between:
> >>>>>>>>>>>>  this.numCopiers =
> conf.getInt("mapred.reduce.parallel.copies",
> >>>>>>>>>>>>
> >>>>>>>>>>>>  5);
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>  and:
> >>>>>>>>>>
> >>>>>>>>>>>   maxSingleShuffleLimit = (long)(maxSize *
> >>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> >>>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> >>>>>>>>>>>>
> >>>>>>>>>>>> because
> >>>>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
> >>>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *
> >>>>>>>>>>>> maxSize
> >>>>>>>>>>>>
> >>>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
> >>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> >>>>>>>>>>>>
> >>>>>>>>>>>>  apache@jacobrideout.net
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We are seeing the following error in our reducers of a
> >>>>>>>>>>>>> particular
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  job:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>  Error: java.lang.OutOfMemoryError: Java heap space
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >>>>>>>>>>
> >>>>>>>>> (ReduceTask.java:1195)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>> After enough reducers fail the entire job fails. This error
> >>>>>>>>>>>>> occurs
> >>>>>>>>>>>>> regardless of whether mapred.compress.map.output is true. We
> >>>>>>>>>>>>> were
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  able
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>  to avoid the issue by reducing
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>  mapred.job.shuffle.input.buffer.percent
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>  to 20%. Shouldn't the framework via
> >>>>>>>>>>
> >>>>>>>>>>>  ShuffleRamManager.canFitInMemory
> >>>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
> >>>>>>>>>>>>> available for allocation? I would think that with poor
> >>>>>>>>>>>>> configuration
> >>>>>>>>>>>>> settings (and default settings in particular) the job may not
> >>>>>>>>>>>>> be as
> >>>>>>>>>>>>> efficient, but wouldn't die.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here is some more context in the logs, I have attached the
> full
> >>>>>>>>>>>>> reducer log here: http://gist.github.com/323746
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO
> >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> >>>>>>>>>>>>> attempt_201003060739_0002_m_000061_0
> >>>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO
> >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  Task
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>  attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> >>>>>>>>>>
> >>>>>>>>>>>  attempt_201003060739_0002_m_000202_0
> >>>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN
> >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
> >>>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4
> >>>>>>>>>>>>> seconds
> >>>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO
> >>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  previous
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>  failures
> >>>>>>>>>>
> >>>>>>>>>>>  2010-03-06 07:54:50,223 FATAL
> >>>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
> >>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy
> failure :
> >>>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >>>>>>>>>>
> >>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>>>>>>>
> >>>>>>>>>    at
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >>>>>>>>>>
> >>>>>>>>> (ReduceTask.java:1195)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
> >>>>>>>>>>>>> MAPREDUCE-1182
> >>>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does
> anyone
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a
> JIRA
> >>>>>>>>>>>>> for?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jacob Rideout
> >>>>>>>>>>>>> Return Path
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>
> >>
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Bo Shi <bs...@visiblemeasures.com>.
Hey Ted, any further insights on this?  We're encountering a similar
issue (on CD2).  I'll be applying MAPREDUCE-1182 to see if that
resolves our case but it sounds like that JIRA didn't completely
eliminate the problem for some folks.

On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu <yu...@gmail.com> wrote:
> I pressed send key a bit early.
>
> I will have to dig a bit deeper.
> Hopefully someone can find reader.close() call after which I will look for
> another possible root cause :-)
>
>
> On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Thanks to Andy for the log he provided.
>>
>> You can see from the log below that size increased steadily from 341535057
>> to 408181692, approaching maxSize. Then OOME:
>>
>>
>> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
>> pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
>> maxSize=417601952
>> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
>> pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
>> maxSize=417601952
>> ...
>> 2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
>> pos=end requestedSize=635753 size=408181692 numPendingRequests=0
>> maxSize=417601952
>> 2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
>> attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
>> attempt_201003101826_0001_m_000875_0
>>
>> 2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
>> 2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
>> failures
>> 2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
>> attempt_201003101826_0001_r_000004_0 : Map output copy failure :
>> java.lang.OutOfMemoryError: Java heap space
>>         at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
>>         at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
>>         at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
>>         at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
>>
>> Looking at the call to unreserve() in ReduceTask, two were for IOException
>> and the other was for Sanity check (line 1557). Meaning they wouldn't be
>> called in normal execution path.
>>
>> I see one call in IFile.InMemoryReader close() method:
>>       // Inform the RamManager
>>       ramManager.unreserve(bufferSize);
>>
>> And InMemoryReader is used in createInMemorySegments():
>>           Reader<K, V> reader =
>>             new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
>>                                      mo.data, 0, mo.data.length);
>>
>> But I don't see reader.close() in ReduceTask file.
>>
>
>
>> On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <ch...@yahoo-inc.com>wrote:
>>
>>> I don't think this OOM is a framework bug per se, and given the
>>> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
>>> 0.20 shuffle semantics is likely not worthwhile (though data informing
>>> improvements to trunk would be excellent). Most likely (and tautologically),
>>> ReduceTask simply requires more memory than is available and the job failure
>>> can be avoided by either 0) increasing the heap size or 1) lowering
>>> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
>>> 1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
>>> stingy amount of space. -C
>>>
>>>
>>> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:
>>>
>>>  I verified that size and maxSize are long. This means MR-1182 didn't
>>>> resolve
>>>> Andy's issue.
>>>>
>>>> According to Andy:
>>>> At the beginning of the job there are 209,754 pending map tasks and 32
>>>> pending reduce tasks
>>>>
>>>> My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
>>>> because of large number of in-memory shuffle candidates.
>>>>
>>>> My suggestion for Andy would be to:
>>>> 1. add -*verbose*:*gc as JVM parameter
>>>> 2. modify reserve() slightly to calculate the maximum outstanding
>>>> numPendingRequests and print the maximum.
>>>>
>>>> Based on the output from above two items, we can discuss solution.
>>>> My intuition is to place upperbound on numPendingRequests beyond which
>>>> canFitInMemory() returns false.
>>>> *
>>>> My two cents.
>>>>
>>>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
>>>> <ch...@yahoo-inc.com>wrote:
>>>>
>>>>  That section of code is unmodified in MR-1182. See the patches/svn log.
>>>>> -C
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>
>>>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>>
>>>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>>>>>
>>>>>> This is what I see in ReduceTask (line 999):
>>>>>>   public synchronized boolean reserve(int requestedSize, InputStream
>>>>>> in)
>>>>>>
>>>>>>   throws InterruptedException {
>>>>>>     // Wait till the request can be fulfilled...
>>>>>>     while ((size + requestedSize) > maxSize) {
>>>>>>
>>>>>> I don't see the fix from MR-1182.
>>>>>>
>>>>>> That's why I suggested to Andy that he manually apply MR-1182.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
>>>>>> andy.sautins@returnpath.net
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  Thanks Christopher.
>>>>>>>
>>>>>>> The heap size for reduce tasks is configured to be 640M (
>>>>>>> mapred.child.java.opts set to -Xmx640m ).
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>>>>>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>>>>>> To: common-user@hadoop.apache.org
>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>
>>>>>>> No, MR-1182 is included in 0.20.2
>>>>>>>
>>>>>>> What heap size have you set for your reduce tasks? -C
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>>>>
>>>>>>> Andy:
>>>>>>>
>>>>>>>> You need to manually apply the patch.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>>>>>
>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>>>>>> in the
>>>>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>>>>>>> re-ran
>>>>>>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>>>>>>> set to 1
>>>>>>>>> and continue to have the same Java heap space error.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>
>>>>>>>>> This issue has been resolved in
>>>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>>>>>
>>>>>>>>> Please apply the patch
>>>>>>>>> M1182-1v20.patch<
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>  On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>>>>>
>>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>>
>>>>>>>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Thanks Ted.  Very helpful.  You are correct that I misunderstood
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>  code
>>>>>>>>>
>>>>>>>>>  at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>>>>>> IOException
>>>>>>>>>>
>>>>>>>>>>  catch
>>>>>>>>>
>>>>>>>>>  block.  My mistake.  That's what I get for being in a rush.
>>>>>>>>>>
>>>>>>>>>> For what it's worth I did re-run the job with
>>>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>>>>>> down to
>>>>>>>>>>
>>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>  All failed with the same error:
>>>>>>>>>>
>>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>   at
>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>>
>>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So from that it does seem like something else might be going on,
>>>>>>>>>> yes?
>>>>>>>>>>
>>>>>>>>>>  I
>>>>>>>>>
>>>>>>>>>  need to do some more research.
>>>>>>>>>>
>>>>>>>>>> I appreciate your insights.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>>
>>>>>>>>>> My observation is based on this call chain:
>>>>>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>>>>>>> calling
>>>>>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>>>>>
>>>>>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>>>>>> considering
>>>>>>>>>> the
>>>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>>>>>> total
>>>>>>>>>>
>>>>>>>>>>  heap
>>>>>>>>>
>>>>>>>>>  may be used in shuffling if default parameters were used.
>>>>>>>>>> Of course, you should check the value for
>>>>>>>>>> mapred.reduce.parallel.copies
>>>>>>>>>>
>>>>>>>>>>  to
>>>>>>>>>
>>>>>>>>>  see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>>>>>>
>>>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>>>>>> 0.20.2
>>>>>>>>>>
>>>>>>>>>>  only
>>>>>>>>>
>>>>>>>>>  has 2731 lines. So I have to guess the location of the code
>>>>>>>>>> snippet you
>>>>>>>>>> provided.
>>>>>>>>>> I found this around line 1535:
>>>>>>>>>>   } catch (IOException ioe) {
>>>>>>>>>>     LOG.info("Failed to shuffle from " +
>>>>>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>>>>>              ioe);
>>>>>>>>>>
>>>>>>>>>>     // Inform the ram-manager
>>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>>>
>>>>>>>>>>     // Discard the map-output
>>>>>>>>>>     try {
>>>>>>>>>>       mapOutput.discard();
>>>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>>>     }
>>>>>>>>>> Please confirm the line number.
>>>>>>>>>>
>>>>>>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>>>>>>> can
>>>>>>>>>> improve it. First, I assume IOException shouldn't happen that
>>>>>>>>>> often.
>>>>>>>>>> Second,
>>>>>>>>>> mapOutput.discard() just sets:
>>>>>>>>>>     data = null;
>>>>>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>>>>>> make more
>>>>>>>>>> memory available.
>>>>>>>>>> Of course, given the large number of map outputs in your system, it
>>>>>>>>>>
>>>>>>>>>>  became
>>>>>>>>>
>>>>>>>>>  more likely that the root cause from my reasoning made OOME happen
>>>>>>>>>>
>>>>>>>>>>  sooner.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Ted,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>>>>>>> following.  If you would mind helping me understand I would
>>>>>>>>>>> appreciate
>>>>>>>>>>>
>>>>>>>>>>>  it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>>>>>> determining
>>>>>>>>>>>
>>>>>>>>>>>  if
>>>>>>>>>>
>>>>>>>>>>  the copy _can_ fit into memory:
>>>>>>>>>>>
>>>>>>>>>>> boolean canFitInMemory(long requestedSize) {
>>>>>>>>>>>   return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>>>>>           requestedSize < maxSingleShuffleLimit);
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> It also looks like the RamManager.reserve should wait until
>>>>>>>>>>> memory
>>>>>>>>>>>
>>>>>>>>>>>  is
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  available so it should hit a memory limit for that reason.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What does seem a little strange to me is the following (
>>>>>>>>>>>
>>>>>>>>>>>  ReduceTask.java
>>>>>>>>>>
>>>>>>>>>>  starting at 2730 ):
>>>>>>>>>>>
>>>>>>>>>>>     // Inform the ram-manager
>>>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>>>>
>>>>>>>>>>>     // Discard the map-output
>>>>>>>>>>>     try {
>>>>>>>>>>>       mapOutput.discard();
>>>>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>>>>     }
>>>>>>>>>>>     mapOutput = null;
>>>>>>>>>>>
>>>>>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>>>>>> before
>>>>>>>>>>>
>>>>>>>>>>>  the
>>>>>>>>>>
>>>>>>>>>>  mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>>>>>> _before_
>>>>>>>>>>>
>>>>>>>>>>>  the
>>>>>>>>>>
>>>>>>>>>>  ramManager unreserves the memory?  If the memory is unreserved
>>>>>>>>>>> before
>>>>>>>>>>>
>>>>>>>>>>>  the
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  actual underlying data references are removed then it seems like
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  another
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>>>>>
>>>>>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>>>>>> particular
>>>>>>>>>>>
>>>>>>>>>>>  job
>>>>>>>>>>
>>>>>>>>>>  that is failing does have a good number ( 200k+ ) of map
>>>>>>>>>>> outputs.  The
>>>>>>>>>>>
>>>>>>>>>>>  large
>>>>>>>>>>
>>>>>>>>>>  number of small map outputs may be why we are triggering a
>>>>>>>>>>> problem.
>>>>>>>>>>>
>>>>>>>>>>> Thanks again for your thoughts.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>>>
>>>>>>>>>>> Ted,
>>>>>>>>>>>
>>>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>>>>>> have
>>>>>>>>>>> some time to write a patch later this week.
>>>>>>>>>>>
>>>>>>>>>>> Jacob Rideout
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>  I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>>>>>  this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>>>>>
>>>>>>>>>>>>  5);
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  and:
>>>>>>>>>>
>>>>>>>>>>>   maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>>>>>
>>>>>>>>>>>> because
>>>>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *
>>>>>>>>>>>> maxSize
>>>>>>>>>>>>
>>>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>>>>>
>>>>>>>>>>>>  apache@jacobrideout.net
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are seeing the following error in our reducers of a
>>>>>>>>>>>>> particular
>>>>>>>>>>>>>
>>>>>>>>>>>>>  job:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>>
>>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>>>>>> occurs
>>>>>>>>>>>>> regardless of whether mapred.compress.map.output is true. We
>>>>>>>>>>>>> were
>>>>>>>>>>>>>
>>>>>>>>>>>>>  able
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  to avoid the issue by reducing
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>  mapred.job.shuffle.input.buffer.percent
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  to 20%. Shouldn't the framework via
>>>>>>>>>>
>>>>>>>>>>>  ShuffleRamManager.canFitInMemory
>>>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>>>>>> be as
>>>>>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO
>>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO
>>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Task
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>>>
>>>>>>>>>>>  attempt_201003060739_0002_m_000202_0
>>>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN
>>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO
>>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>>>>>
>>>>>>>>>>>>>  previous
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  failures
>>>>>>>>>>
>>>>>>>>>>>  2010-03-06 07:54:50,223 FATAL
>>>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>>
>>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>>
>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>>
>>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>>>>>> MAPREDUCE-1182
>>>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>>>>>>> have
>>>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>>>>>>> for?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jacob Rideout
>>>>>>>>>>>>> Return Path
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>
>>
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
I pressed send key a bit early.

I will have to dig a bit deeper.
Hopefully someone can find reader.close() call after which I will look for
another possible root cause :-)


On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:

> Thanks to Andy for the log he provided.
>
> You can see from the log below that size increased steadily from 341535057
> to 408181692, approaching maxSize. Then OOME:
>
>
> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
> pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
> maxSize=417601952
> 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
> pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
> maxSize=417601952
> ...
> 2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
> pos=end requestedSize=635753 size=408181692 numPendingRequests=0
> maxSize=417601952
> 2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
> attempt_201003101826_0001_m_000875_0
>
> 2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
> 2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
> failures
> 2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
> attempt_201003101826_0001_r_000004_0 : Map output copy failure :
> java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
>
> Looking at the call to unreserve() in ReduceTask, two were for IOException
> and the other was for Sanity check (line 1557). Meaning they wouldn't be
> called in normal execution path.
>
> I see one call in IFile.InMemoryReader close() method:
>       // Inform the RamManager
>       ramManager.unreserve(bufferSize);
>
> And InMemoryReader is used in createInMemorySegments():
>           Reader<K, V> reader =
>             new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
>                                      mo.data, 0, mo.data.length);
>
> But I don't see reader.close() in ReduceTask file.
>


> On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <ch...@yahoo-inc.com>wrote:
>
>> I don't think this OOM is a framework bug per se, and given the
>> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
>> 0.20 shuffle semantics is likely not worthwhile (though data informing
>> improvements to trunk would be excellent). Most likely (and tautologically),
>> ReduceTask simply requires more memory than is available and the job failure
>> can be avoided by either 0) increasing the heap size or 1) lowering
>> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
>> 1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
>> stingy amount of space. -C
>>
>>
>> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:
>>
>>  I verified that size and maxSize are long. This means MR-1182 didn't
>>> resolve
>>> Andy's issue.
>>>
>>> According to Andy:
>>> At the beginning of the job there are 209,754 pending map tasks and 32
>>> pending reduce tasks
>>>
>>> My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
>>> because of large number of in-memory shuffle candidates.
>>>
>>> My suggestion for Andy would be to:
>>> 1. add -*verbose*:*gc as JVM parameter
>>> 2. modify reserve() slightly to calculate the maximum outstanding
>>> numPendingRequests and print the maximum.
>>>
>>> Based on the output from above two items, we can discuss solution.
>>> My intuition is to place upperbound on numPendingRequests beyond which
>>> canFitInMemory() returns false.
>>> *
>>> My two cents.
>>>
>>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
>>> <ch...@yahoo-inc.com>wrote:
>>>
>>>  That section of code is unmodified in MR-1182. See the patches/svn log.
>>>> -C
>>>>
>>>> Sent from my iPhone
>>>>
>>>>
>>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>
>>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>>>>
>>>>> This is what I see in ReduceTask (line 999):
>>>>>   public synchronized boolean reserve(int requestedSize, InputStream
>>>>> in)
>>>>>
>>>>>   throws InterruptedException {
>>>>>     // Wait till the request can be fulfilled...
>>>>>     while ((size + requestedSize) > maxSize) {
>>>>>
>>>>> I don't see the fix from MR-1182.
>>>>>
>>>>> That's why I suggested to Andy that he manually apply MR-1182.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
>>>>> andy.sautins@returnpath.net
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>  Thanks Christopher.
>>>>>>
>>>>>> The heap size for reduce tasks is configured to be 640M (
>>>>>> mapred.child.java.opts set to -Xmx640m ).
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>>>>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>
>>>>>> No, MR-1182 is included in 0.20.2
>>>>>>
>>>>>> What heap size have you set for your reduce tasks? -C
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>>>
>>>>>> Andy:
>>>>>>
>>>>>>> You need to manually apply the patch.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>>>>
>>>>>>>  andy.sautins@returnpath.net
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>>>>> in the
>>>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>>>>>> re-ran
>>>>>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>>>>>> set to 1
>>>>>>>> and continue to have the same Java heap space error.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>
>>>>>>>> This issue has been resolved in
>>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>>>>
>>>>>>>> Please apply the patch
>>>>>>>> M1182-1v20.patch<
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>  On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>>>>
>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  Thanks Ted.  Very helpful.  You are correct that I misunderstood
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>  code
>>>>>>>>
>>>>>>>>  at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>>>>> IOException
>>>>>>>>>
>>>>>>>>>  catch
>>>>>>>>
>>>>>>>>  block.  My mistake.  That's what I get for being in a rush.
>>>>>>>>>
>>>>>>>>> For what it's worth I did re-run the job with
>>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>>>>> down to
>>>>>>>>>
>>>>>>>>>  1.
>>>>>>>>
>>>>>>>>  All failed with the same error:
>>>>>>>>>
>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>   at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>
>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> So from that it does seem like something else might be going on,
>>>>>>>>> yes?
>>>>>>>>>
>>>>>>>>>  I
>>>>>>>>
>>>>>>>>  need to do some more research.
>>>>>>>>>
>>>>>>>>> I appreciate your insights.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>
>>>>>>>>> My observation is based on this call chain:
>>>>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>>>>>> calling
>>>>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>>>>
>>>>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>>>>> considering
>>>>>>>>> the
>>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>>>>> total
>>>>>>>>>
>>>>>>>>>  heap
>>>>>>>>
>>>>>>>>  may be used in shuffling if default parameters were used.
>>>>>>>>> Of course, you should check the value for
>>>>>>>>> mapred.reduce.parallel.copies
>>>>>>>>>
>>>>>>>>>  to
>>>>>>>>
>>>>>>>>  see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>>>>>
>>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>>>>> 0.20.2
>>>>>>>>>
>>>>>>>>>  only
>>>>>>>>
>>>>>>>>  has 2731 lines. So I have to guess the location of the code
>>>>>>>>> snippet you
>>>>>>>>> provided.
>>>>>>>>> I found this around line 1535:
>>>>>>>>>   } catch (IOException ioe) {
>>>>>>>>>     LOG.info("Failed to shuffle from " +
>>>>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>>>>              ioe);
>>>>>>>>>
>>>>>>>>>     // Inform the ram-manager
>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>>
>>>>>>>>>     // Discard the map-output
>>>>>>>>>     try {
>>>>>>>>>       mapOutput.discard();
>>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>>     }
>>>>>>>>> Please confirm the line number.
>>>>>>>>>
>>>>>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>>>>>> can
>>>>>>>>> improve it. First, I assume IOException shouldn't happen that
>>>>>>>>> often.
>>>>>>>>> Second,
>>>>>>>>> mapOutput.discard() just sets:
>>>>>>>>>     data = null;
>>>>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>>>>> make more
>>>>>>>>> memory available.
>>>>>>>>> Of course, given the large number of map outputs in your system, it
>>>>>>>>>
>>>>>>>>>  became
>>>>>>>>
>>>>>>>>  more likely that the root cause from my reasoning made OOME happen
>>>>>>>>>
>>>>>>>>>  sooner.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Ted,
>>>>>>>>>>
>>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>>>>>> following.  If you would mind helping me understand I would
>>>>>>>>>> appreciate
>>>>>>>>>>
>>>>>>>>>>  it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>>>>> determining
>>>>>>>>>>
>>>>>>>>>>  if
>>>>>>>>>
>>>>>>>>>  the copy _can_ fit into memory:
>>>>>>>>>>
>>>>>>>>>> boolean canFitInMemory(long requestedSize) {
>>>>>>>>>>   return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>>>>           requestedSize < maxSingleShuffleLimit);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> It also looks like the RamManager.reserve should wait until
>>>>>>>>>> memory
>>>>>>>>>>
>>>>>>>>>>  is
>>>>>>>>>
>>>>>>>>
>>>>>>>>  available so it should hit a memory limit for that reason.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What does seem a little strange to me is the following (
>>>>>>>>>>
>>>>>>>>>>  ReduceTask.java
>>>>>>>>>
>>>>>>>>>  starting at 2730 ):
>>>>>>>>>>
>>>>>>>>>>     // Inform the ram-manager
>>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>>>
>>>>>>>>>>     // Discard the map-output
>>>>>>>>>>     try {
>>>>>>>>>>       mapOutput.discard();
>>>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>>>     }
>>>>>>>>>>     mapOutput = null;
>>>>>>>>>>
>>>>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>>>>> before
>>>>>>>>>>
>>>>>>>>>>  the
>>>>>>>>>
>>>>>>>>>  mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>>>>> _before_
>>>>>>>>>>
>>>>>>>>>>  the
>>>>>>>>>
>>>>>>>>>  ramManager unreserves the memory?  If the memory is unreserved
>>>>>>>>>> before
>>>>>>>>>>
>>>>>>>>>>  the
>>>>>>>>>
>>>>>>>>
>>>>>>>>  actual underlying data references are removed then it seems like
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  another
>>>>>>>>>
>>>>>>>>
>>>>>>>>  thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>>>>
>>>>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>>>>> particular
>>>>>>>>>>
>>>>>>>>>>  job
>>>>>>>>>
>>>>>>>>>  that is failing does have a good number ( 200k+ ) of map
>>>>>>>>>> outputs.  The
>>>>>>>>>>
>>>>>>>>>>  large
>>>>>>>>>
>>>>>>>>>  number of small map outputs may be why we are triggering a
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> Thanks again for your thoughts.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>>
>>>>>>>>>> Ted,
>>>>>>>>>>
>>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>>>>> have
>>>>>>>>>> some time to write a patch later this week.
>>>>>>>>>>
>>>>>>>>>> Jacob Rideout
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>>>>  this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>>>>
>>>>>>>>>>>  5);
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  and:
>>>>>>>>>
>>>>>>>>>>   maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>>>>
>>>>>>>>>>> because
>>>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *
>>>>>>>>>>> maxSize
>>>>>>>>>>>
>>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>>>>
>>>>>>>>>>>  apache@jacobrideout.net
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We are seeing the following error in our reducers of a
>>>>>>>>>>>> particular
>>>>>>>>>>>>
>>>>>>>>>>>>  job:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>>  Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>>   at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>
>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>>>>> occurs
>>>>>>>>>>>> regardless of whether mapred.compress.map.output is true. We
>>>>>>>>>>>> were
>>>>>>>>>>>>
>>>>>>>>>>>>  able
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  to avoid the issue by reducing
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  mapred.job.shuffle.input.buffer.percent
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  to 20%. Shouldn't the framework via
>>>>>>>>>
>>>>>>>>>>  ShuffleRamManager.canFitInMemory
>>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>>>>> configuration
>>>>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>>>>> be as
>>>>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO
>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO
>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>>
>>>>>>>>>>>>  Task
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>>
>>>>>>>>>>  attempt_201003060739_0002_m_000202_0
>>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN
>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO
>>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>>>>
>>>>>>>>>>>>  previous
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  failures
>>>>>>>>>
>>>>>>>>>>  2010-03-06 07:54:50,223 FATAL
>>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>>   at
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>>
>>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>
>>>>>>>>    at
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>>
>>>>>>>> (ReduceTask.java:1195)
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>>>>> MAPREDUCE-1182
>>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>>>>>> have
>>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>>>>>> for?
>>>>>>>>>>>>
>>>>>>>>>>>> Jacob Rideout
>>>>>>>>>>>> Return Path
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
Thanks to Andy for the log he provided.

You can see from the log below that size increased steadily from 341535057
to 408181692, approaching maxSize. Then OOME:

2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
maxSize=417601952
2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
maxSize=417601952
...
2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
pos=end requestedSize=635753 size=408181692 numPendingRequests=0
maxSize=417601952
2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
attempt_201003101826_0001_m_000875_0

2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.net to
penalty box, next contact in 4 seconds
2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
failures
2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201003101826_0001_r_000004_0 : Map output copy failure :
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)

Looking at the call to unreserve() in ReduceTask, two were for IOException
and the other was for Sanity check (line 1557). Meaning they wouldn't be
called in normal execution path.

I see one call in IFile.InMemoryReader:
      // Inform the RamManager
      ramManager.unreserve(bufferSize);

And InMemoryReader is used in
          Reader<K, V> reader =
            new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
                                     mo.data, 0, mo.data.length);


On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <ch...@yahoo-inc.com>wrote:

> I don't think this OOM is a framework bug per se, and given the
> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
> 0.20 shuffle semantics is likely not worthwhile (though data informing
> improvements to trunk would be excellent). Most likely (and tautologically),
> ReduceTask simply requires more memory than is available and the job failure
> can be avoided by either 0) increasing the heap size or 1) lowering
> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
> 1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
> stingy amount of space. -C
>
>
> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:
>
>  I verified that size and maxSize are long. This means MR-1182 didn't
>> resolve
>> Andy's issue.
>>
>> According to Andy:
>> At the beginning of the job there are 209,754 pending map tasks and 32
>> pending reduce tasks
>>
>> My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
>> because of large number of in-memory shuffle candidates.
>>
>> My suggestion for Andy would be to:
>> 1. add -*verbose*:*gc as JVM parameter
>> 2. modify reserve() slightly to calculate the maximum outstanding
>> numPendingRequests and print the maximum.
>>
>> Based on the output from above two items, we can discuss solution.
>> My intuition is to place upperbound on numPendingRequests beyond which
>> canFitInMemory() returns false.
>> *
>> My two cents.
>>
>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
>> <ch...@yahoo-inc.com>wrote:
>>
>>  That section of code is unmodified in MR-1182. See the patches/svn log.
>>> -C
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>>>
>>>> This is what I see in ReduceTask (line 999):
>>>>   public synchronized boolean reserve(int requestedSize, InputStream in)
>>>>
>>>>   throws InterruptedException {
>>>>     // Wait till the request can be fulfilled...
>>>>     while ((size + requestedSize) > maxSize) {
>>>>
>>>> I don't see the fix from MR-1182.
>>>>
>>>> That's why I suggested to Andy that he manually apply MR-1182.
>>>>
>>>> Cheers
>>>>
>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
>>>> andy.sautins@returnpath.net
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>
>>>>  Thanks Christopher.
>>>>>
>>>>> The heap size for reduce tasks is configured to be 640M (
>>>>> mapred.child.java.opts set to -Xmx640m ).
>>>>>
>>>>> Andy
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>>>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>
>>>>> No, MR-1182 is included in 0.20.2
>>>>>
>>>>> What heap size have you set for your reduce tasks? -C
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>>
>>>>> Andy:
>>>>>
>>>>>> You need to manually apply the patch.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>>>
>>>>>>  andy.sautins@returnpath.net
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>>>> in the
>>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>>>>> re-ran
>>>>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>>>>> set to 1
>>>>>>> and continue to have the same Java heap space error.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>>>> To: common-user@hadoop.apache.org
>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>
>>>>>>> This issue has been resolved in
>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>>>
>>>>>>> Please apply the patch
>>>>>>> M1182-1v20.patch<
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>>>
>>>>>
>>>>>>
>>>>>>>>  On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>>>
>>>>>>>  andy.sautins@returnpath.net
>>>>>>
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the
>>>>>>>>
>>>>>>>>  code
>>>>>>>
>>>>>>>  at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>>>> IOException
>>>>>>>>
>>>>>>>>  catch
>>>>>>>
>>>>>>>  block.  My mistake.  That's what I get for being in a rush.
>>>>>>>>
>>>>>>>> For what it's worth I did re-run the job with
>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>>>> down to
>>>>>>>>
>>>>>>>>  1.
>>>>>>>
>>>>>>>  All failed with the same error:
>>>>>>>>
>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>   at
>>>>>>>>
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>
>>>>>>> (ReduceTask.java:1195)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> So from that it does seem like something else might be going on,
>>>>>>>> yes?
>>>>>>>>
>>>>>>>>  I
>>>>>>>
>>>>>>>  need to do some more research.
>>>>>>>>
>>>>>>>> I appreciate your insights.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>
>>>>>>>> My observation is based on this call chain:
>>>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>>>>> calling
>>>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>>>
>>>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>>>> considering
>>>>>>>> the
>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>>>> total
>>>>>>>>
>>>>>>>>  heap
>>>>>>>
>>>>>>>  may be used in shuffling if default parameters were used.
>>>>>>>> Of course, you should check the value for
>>>>>>>> mapred.reduce.parallel.copies
>>>>>>>>
>>>>>>>>  to
>>>>>>>
>>>>>>>  see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>>>>
>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>>>> 0.20.2
>>>>>>>>
>>>>>>>>  only
>>>>>>>
>>>>>>>  has 2731 lines. So I have to guess the location of the code
>>>>>>>> snippet you
>>>>>>>> provided.
>>>>>>>> I found this around line 1535:
>>>>>>>>   } catch (IOException ioe) {
>>>>>>>>     LOG.info("Failed to shuffle from " +
>>>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>>>              ioe);
>>>>>>>>
>>>>>>>>     // Inform the ram-manager
>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>
>>>>>>>>     // Discard the map-output
>>>>>>>>     try {
>>>>>>>>       mapOutput.discard();
>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>     }
>>>>>>>> Please confirm the line number.
>>>>>>>>
>>>>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>>>>> can
>>>>>>>> improve it. First, I assume IOException shouldn't happen that often.
>>>>>>>> Second,
>>>>>>>> mapOutput.discard() just sets:
>>>>>>>>     data = null;
>>>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>>>> make more
>>>>>>>> memory available.
>>>>>>>> Of course, given the large number of map outputs in your system, it
>>>>>>>>
>>>>>>>>  became
>>>>>>>
>>>>>>>  more likely that the root cause from my reasoning made OOME happen
>>>>>>>>
>>>>>>>>  sooner.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>>>>
>>>>>>>>
>>>>>>>>  andy.sautins@returnpath.net
>>>>>>>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  Ted,
>>>>>>>>>
>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>>>>> following.  If you would mind helping me understand I would
>>>>>>>>> appreciate
>>>>>>>>>
>>>>>>>>>  it.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>>>> determining
>>>>>>>>>
>>>>>>>>>  if
>>>>>>>>
>>>>>>>>  the copy _can_ fit into memory:
>>>>>>>>>
>>>>>>>>> boolean canFitInMemory(long requestedSize) {
>>>>>>>>>   return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>>>           requestedSize < maxSingleShuffleLimit);
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> It also looks like the RamManager.reserve should wait until
>>>>>>>>> memory
>>>>>>>>>
>>>>>>>>>  is
>>>>>>>>
>>>>>>>
>>>>>>>  available so it should hit a memory limit for that reason.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> What does seem a little strange to me is the following (
>>>>>>>>>
>>>>>>>>>  ReduceTask.java
>>>>>>>>
>>>>>>>>  starting at 2730 ):
>>>>>>>>>
>>>>>>>>>     // Inform the ram-manager
>>>>>>>>>     ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>>     ramManager.unreserve(mapOutputLength);
>>>>>>>>>
>>>>>>>>>     // Discard the map-output
>>>>>>>>>     try {
>>>>>>>>>       mapOutput.discard();
>>>>>>>>>     } catch (IOException ignored) {
>>>>>>>>>       LOG.info("Failed to discard map-output from " +
>>>>>>>>>                mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>>     }
>>>>>>>>>     mapOutput = null;
>>>>>>>>>
>>>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>>>> before
>>>>>>>>>
>>>>>>>>>  the
>>>>>>>>
>>>>>>>>  mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>>>> _before_
>>>>>>>>>
>>>>>>>>>  the
>>>>>>>>
>>>>>>>>  ramManager unreserves the memory?  If the memory is unreserved
>>>>>>>>> before
>>>>>>>>>
>>>>>>>>>  the
>>>>>>>>
>>>>>>>
>>>>>>>  actual underlying data references are removed then it seems like
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  another
>>>>>>>>
>>>>>>>
>>>>>>>  thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>>>>
>>>>>>>>> the
>>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>>>
>>>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>>>> particular
>>>>>>>>>
>>>>>>>>>  job
>>>>>>>>
>>>>>>>>  that is failing does have a good number ( 200k+ ) of map
>>>>>>>>> outputs.  The
>>>>>>>>>
>>>>>>>>>  large
>>>>>>>>
>>>>>>>>  number of small map outputs may be why we are triggering a problem.
>>>>>>>>>
>>>>>>>>> Thanks again for your thoughts.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>>
>>>>>>>>> Ted,
>>>>>>>>>
>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>>>> have
>>>>>>>>> some time to write a patch later this week.
>>>>>>>>>
>>>>>>>>> Jacob Rideout
>>>>>>>>>
>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>>>  this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>>>
>>>>>>>>>>  5);
>>>>>>>>>
>>>>>>>>
>>>>>>>  and:
>>>>>>>>
>>>>>>>>>   maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>>>
>>>>>>>>>> because
>>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>>>>>>>>>>
>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>>>
>>>>>>>>>>  apache@jacobrideout.net
>>>>>>>>>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We are seeing the following error in our reducers of a particular
>>>>>>>>>>>
>>>>>>>>>>>  job:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>>  Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>   at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>
>>>>>>> (ReduceTask.java:1195)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>>>> occurs
>>>>>>>>>>> regardless of whether mapred.compress.map.output is true. We were
>>>>>>>>>>>
>>>>>>>>>>>  able
>>>>>>>>>>
>>>>>>>>>
>>>>>>>  to avoid the issue by reducing
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>  mapred.job.shuffle.input.buffer.percent
>>>>>>>>>>
>>>>>>>>>
>>>>>>>  to 20%. Shouldn't the framework via
>>>>>>>>
>>>>>>>>> ShuffleRamManager.canFitInMemory
>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>>>> configuration
>>>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>>>> be as
>>>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>>>
>>>>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>>
>>>>>>>>>>>  Task
>>>>>>>>>>
>>>>>>>>>
>>>>>>>  attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>
>>>>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4
>>>>>>>>>>> seconds
>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>>>
>>>>>>>>>>>  previous
>>>>>>>>>>
>>>>>>>>>
>>>>>>>  failures
>>>>>>>>
>>>>>>>>> 2010-03-06 07:54:50,223 FATAL
>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>   at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>>>>
>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>
>>>>>>>    at
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>>>>>
>>>>>>> (ReduceTask.java:1195)
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>>>> MAPREDUCE-1182
>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>>>>> have
>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>>>>> for?
>>>>>>>>>>>
>>>>>>>>>>> Jacob Rideout
>>>>>>>>>>> Return Path
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Chris Douglas <ch...@yahoo-inc.com>.
I don't think this OOM is a framework bug per se, and given the  
rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning  
the 0.20 shuffle semantics is likely not worthwhile (though data  
informing improvements to trunk would be excellent). Most likely (and  
tautologically), ReduceTask simply requires more memory than is  
available and the job failure can be avoided by either 0) increasing  
the heap size or 1) lowering mapred.shuffle.input.buffer.percent. Most  
of the tasks we run have a heap of 1GB. For a reduce fetching >200k  
map outputs, that's a reasonable, even stingy amount of space. -C

On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

> I verified that size and maxSize are long. This means MR-1182 didn't  
> resolve
> Andy's issue.
>
> According to Andy:
> At the beginning of the job there are 209,754 pending map tasks and 32
> pending reduce tasks
>
> My guess is that GC wasn't reclaiming memory fast enough, leading to  
> OOME
> because of large number of in-memory shuffle candidates.
>
> My suggestion for Andy would be to:
> 1. add -*verbose*:*gc as JVM parameter
> 2. modify reserve() slightly to calculate the maximum outstanding
> numPendingRequests and print the maximum.
>
> Based on the output from above two items, we can discuss solution.
> My intuition is to place upperbound on numPendingRequests beyond which
> canFitInMemory() returns false.
> *
> My two cents.
>
> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
> <ch...@yahoo-inc.com>wrote:
>
>> That section of code is unmodified in MR-1182. See the patches/svn  
>> log. -C
>>
>> Sent from my iPhone
>>
>>
>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>>> This is what I see in ReduceTask (line 999):
>>>    public synchronized boolean reserve(int requestedSize,  
>>> InputStream in)
>>>
>>>    throws InterruptedException {
>>>      // Wait till the request can be fulfilled...
>>>      while ((size + requestedSize) > maxSize) {
>>>
>>> I don't see the fix from MR-1182.
>>>
>>> That's why I suggested to Andy that he manually apply MR-1182.
>>>
>>> Cheers
>>>
>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net
>>>> wrote:
>>>
>>>
>>>> Thanks Christopher.
>>>>
>>>> The heap size for reduce tasks is configured to be 640M (
>>>> mapred.child.java.opts set to -Xmx640m ).
>>>>
>>>> Andy
>>>>
>>>> -----Original Message-----
>>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>>> To: common-user@hadoop.apache.org
>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>
>>>> No, MR-1182 is included in 0.20.2
>>>>
>>>> What heap size have you set for your reduce tasks? -C
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>
>>>> Andy:
>>>>> You need to manually apply the patch.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>>
>>>> andy.sautins@returnpath.net
>>>>
>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>> Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>>> in the
>>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend  
>>>>>> and
>>>>>> re-ran
>>>>>> the same job scenarios.  Running with  
>>>>>> mapred.reduce.parallel.copies
>>>>>> set to 1
>>>>>> and continue to have the same Java heap space error.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>
>>>>>> This issue has been resolved in
>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>>
>>>>>> Please apply the patch
>>>>>> M1182-1v20.patch<
>>>>>>
>>>>>>
>>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>>
>>>>>
>>>>>>>
>>>>>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>>
>>>>> andy.sautins@returnpath.net
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks Ted.  Very helpful.  You are correct that I  
>>>>>>> misunderstood the
>>>>>>>
>>>>>> code
>>>>>>
>>>>>>> at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>>> IOException
>>>>>>>
>>>>>> catch
>>>>>>
>>>>>>> block.  My mistake.  That's what I get for being in a rush.
>>>>>>>
>>>>>>> For what it's worth I did re-run the job with
>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>>> down to
>>>>>>>
>>>>>> 1.
>>>>>>
>>>>>>> All failed with the same error:
>>>>>>>
>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>>>>> $MapOutputCopier.run
>>>>>> (ReduceTask.java:1195)
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So from that it does seem like something else might be going on,
>>>>>>> yes?
>>>>>>>
>>>>>> I
>>>>>>
>>>>>>> need to do some more research.
>>>>>>>
>>>>>>> I appreciate your insights.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>>> To: common-user@hadoop.apache.org
>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>
>>>>>>> My observation is based on this call chain:
>>>>>>> MapOutputCopier.run() calling copyOutput() calling  
>>>>>>> getMapOutput()
>>>>>>> calling
>>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>>
>>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>>> considering
>>>>>>> the
>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>>> total
>>>>>>>
>>>>>> heap
>>>>>>
>>>>>>> may be used in shuffling if default parameters were used.
>>>>>>> Of course, you should check the value for
>>>>>>> mapred.reduce.parallel.copies
>>>>>>>
>>>>>> to
>>>>>>
>>>>>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't  
>>>>>>> apply.
>>>>>>>
>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>>> 0.20.2
>>>>>>>
>>>>>> only
>>>>>>
>>>>>>> has 2731 lines. So I have to guess the location of the code
>>>>>>> snippet you
>>>>>>> provided.
>>>>>>> I found this around line 1535:
>>>>>>>    } catch (IOException ioe) {
>>>>>>>      LOG.info("Failed to shuffle from " +
>>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>>               ioe);
>>>>>>>
>>>>>>>      // Inform the ram-manager
>>>>>>>      ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>      ramManager.unreserve(mapOutputLength);
>>>>>>>
>>>>>>>      // Discard the map-output
>>>>>>>      try {
>>>>>>>        mapOutput.discard();
>>>>>>>      } catch (IOException ignored) {
>>>>>>>        LOG.info("Failed to discard map-output from " +
>>>>>>>                 mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>      }
>>>>>>> Please confirm the line number.
>>>>>>>
>>>>>>> If we're looking at the same code, I am afraid I don't see how  
>>>>>>> we
>>>>>>> can
>>>>>>> improve it. First, I assume IOException shouldn't happen that  
>>>>>>> often.
>>>>>>> Second,
>>>>>>> mapOutput.discard() just sets:
>>>>>>>      data = null;
>>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>>> make more
>>>>>>> memory available.
>>>>>>> Of course, given the large number of map outputs in your  
>>>>>>> system, it
>>>>>>>
>>>>>> became
>>>>>>
>>>>>>> more likely that the root cause from my reasoning made OOME  
>>>>>>> happen
>>>>>>>
>>>>>> sooner.
>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>>
>>>>>> andy.sautins@returnpath.net
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Ted,
>>>>>>>>
>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure  
>>>>>>>> I'm
>>>>>>>> following.  If you would mind helping me understand I would
>>>>>>>> appreciate
>>>>>>>>
>>>>>>> it.
>>>>>>>
>>>>>>>>
>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>>> determining
>>>>>>>>
>>>>>>> if
>>>>>>>
>>>>>>>> the copy _can_ fit into memory:
>>>>>>>>
>>>>>>>> boolean canFitInMemory(long requestedSize) {
>>>>>>>>    return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>>            requestedSize < maxSingleShuffleLimit);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> It also looks like the RamManager.reserve should wait until
>>>>>>>> memory
>>>>>>>>
>>>>>>> is
>>>>>>
>>>>>>> available so it should hit a memory limit for that reason.
>>>>>>>>
>>>>>>>> What does seem a little strange to me is the following (
>>>>>>>>
>>>>>>> ReduceTask.java
>>>>>>>
>>>>>>>> starting at 2730 ):
>>>>>>>>
>>>>>>>>      // Inform the ram-manager
>>>>>>>>      ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>>      ramManager.unreserve(mapOutputLength);
>>>>>>>>
>>>>>>>>      // Discard the map-output
>>>>>>>>      try {
>>>>>>>>        mapOutput.discard();
>>>>>>>>      } catch (IOException ignored) {
>>>>>>>>        LOG.info("Failed to discard map-output from " +
>>>>>>>>                 mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>>      }
>>>>>>>>      mapOutput = null;
>>>>>>>>
>>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>>> before
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>>> _before_
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> ramManager unreserves the memory?  If the memory is unreserved
>>>>>>>> before
>>>>>>>>
>>>>>>> the
>>>>>>
>>>>>>> actual underlying data references are removed then it seems like
>>>>>>>>
>>>>>>> another
>>>>>>
>>>>>>> thread can try to allocate memory ( ReduceTask.java:2730 )  
>>>>>>> before
>>>>>>>> the
>>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>>
>>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>>> particular
>>>>>>>>
>>>>>>> job
>>>>>>>
>>>>>>>> that is failing does have a good number ( 200k+ ) of map
>>>>>>>> outputs.  The
>>>>>>>>
>>>>>>> large
>>>>>>>
>>>>>>>> number of small map outputs may be why we are triggering a  
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Thanks again for your thoughts.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>>
>>>>>>>> Ted,
>>>>>>>>
>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>>> have
>>>>>>>> some time to write a patch later this week.
>>>>>>>>
>>>>>>>> Jacob Rideout
>>>>>>>>
>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>>  this.numCopiers =  
>>>>>>>>> conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>>
>>>>>>>> 5);
>>>>>>
>>>>>>> and:
>>>>>>>>>    maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>>
>>>>>>>>> because
>>>>>>>>>  copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *  
>>>>>>>>> maxSize
>>>>>>>>>
>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>>
>>>>>>>> apache@jacobrideout.net
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> We are seeing the following error in our reducers of a  
>>>>>>>>>> particular
>>>>>>>>>>
>>>>>>>>> job:
>>>>>>
>>>>>>>
>>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>>>>> $MapOutputCopier.run
>>>>>> (ReduceTask.java:1195)
>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>>> occurs
>>>>>>>>>> regardless of whether mapred.compress.map.output is true.  
>>>>>>>>>> We were
>>>>>>>>>>
>>>>>>>>> able
>>>>>>
>>>>>>> to avoid the issue by reducing
>>>>>>>>>>
>>>>>>>>> mapred.job.shuffle.input.buffer.percent
>>>>>>
>>>>>>> to 20%. Shouldn't the framework via
>>>>>>>>>> ShuffleRamManager.canFitInMemory
>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>>> configuration
>>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>>> be as
>>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>>
>>>>>>>>>> Here is some more context in the logs, I have attached the  
>>>>>>>>>> full
>>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2010-03-06 07:54:49,621 INFO  
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>>> 2010-03-06 07:54:50,222 INFO  
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>>
>>>>>>>>> Task
>>>>>>
>>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>>>>>> 2010-03-06 07:54:50,223 WARN  
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4  
>>>>>>>>>> seconds
>>>>>>>>>> 2010-03-06 07:54:50,223 INFO  
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>>
>>>>>>>>> previous
>>>>>>
>>>>>>> failures
>>>>>>>>>> 2010-03-06 07:54:50,223 FATAL
>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy  
>>>>>>>>>> failure :
>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>
>>>>>>>    at
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>>>>> $MapOutputCopier.run
>>>>>> (ReduceTask.java:1195)
>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>>> MAPREDUCE-1182
>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does  
>>>>>>>>>> anyone
>>>>>>>>>> have
>>>>>>>>>> any comments or suggestions? Is this a bug I should file a  
>>>>>>>>>> JIRA
>>>>>>>>>> for?
>>>>>>>>>>
>>>>>>>>>> Jacob Rideout
>>>>>>>>>> Return Path
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>


Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
I verified that size and maxSize are long. This means MR-1182 didn't resolve
Andy's issue.

According to Andy:
At the beginning of the job there are 209,754 pending map tasks and 32
pending reduce tasks

My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
because of large number of in-memory shuffle candidates.

My suggestion for Andy would be to:
1. add -*verbose*:*gc as JVM parameter
2. modify reserve() slightly to calculate the maximum outstanding
numPendingRequests and print the maximum.

Based on the output from above two items, we can discuss solution.
My intuition is to place upperbound on numPendingRequests beyond which
canFitInMemory() returns false.
*
My two cents.

On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
<ch...@yahoo-inc.com>wrote:

> That section of code is unmodified in MR-1182. See the patches/svn log. -C
>
> Sent from my iPhone
>
>
> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
>  I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>> This is what I see in ReduceTask (line 999):
>>     public synchronized boolean reserve(int requestedSize, InputStream in)
>>
>>     throws InterruptedException {
>>       // Wait till the request can be fulfilled...
>>       while ((size + requestedSize) > maxSize) {
>>
>> I don't see the fix from MR-1182.
>>
>> That's why I suggested to Andy that he manually apply MR-1182.
>>
>> Cheers
>>
>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net
>> >wrote:
>>
>>
>>> Thanks Christopher.
>>>
>>> The heap size for reduce tasks is configured to be 640M (
>>> mapred.child.java.opts set to -Xmx640m ).
>>>
>>> Andy
>>>
>>> -----Original Message-----
>>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>
>>> No, MR-1182 is included in 0.20.2
>>>
>>> What heap size have you set for your reduce tasks? -C
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>  Andy:
>>>> You need to manually apply the patch.
>>>>
>>>> Cheers
>>>>
>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>
>>> andy.sautins@returnpath.net
>>>
>>>> wrote:
>>>>>
>>>>
>>>>
>>>>> Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>> in the
>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>>> re-ran
>>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>>> set to 1
>>>>> and continue to have the same Java heap space error.
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>
>>>>> This issue has been resolved in
>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>
>>>>> Please apply the patch
>>>>> M1182-1v20.patch<
>>>>>
>>>>>
>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>
>>>>
>>>>>>
>>>>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>
>>>> andy.sautins@returnpath.net
>>>
>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>> Thanks Ted.  Very helpful.  You are correct that I misunderstood the
>>>>>>
>>>>> code
>>>>>
>>>>>> at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>> IOException
>>>>>>
>>>>> catch
>>>>>
>>>>>> block.  My mistake.  That's what I get for being in a rush.
>>>>>>
>>>>>> For what it's worth I did re-run the job with
>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>> down to
>>>>>>
>>>>> 1.
>>>>>
>>>>>> All failed with the same error:
>>>>>>
>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>
>>>>>> So from that it does seem like something else might be going on,
>>>>>> yes?
>>>>>>
>>>>> I
>>>>>
>>>>>> need to do some more research.
>>>>>>
>>>>>> I appreciate your insights.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>
>>>>>> My observation is based on this call chain:
>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>>> calling
>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>
>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>> considering
>>>>>> the
>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>> total
>>>>>>
>>>>> heap
>>>>>
>>>>>> may be used in shuffling if default parameters were used.
>>>>>> Of course, you should check the value for
>>>>>> mapred.reduce.parallel.copies
>>>>>>
>>>>> to
>>>>>
>>>>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>>
>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>> 0.20.2
>>>>>>
>>>>> only
>>>>>
>>>>>> has 2731 lines. So I have to guess the location of the code
>>>>>> snippet you
>>>>>> provided.
>>>>>> I found this around line 1535:
>>>>>>     } catch (IOException ioe) {
>>>>>>       LOG.info("Failed to shuffle from " +
>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>                ioe);
>>>>>>
>>>>>>       // Inform the ram-manager
>>>>>>       ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>       ramManager.unreserve(mapOutputLength);
>>>>>>
>>>>>>       // Discard the map-output
>>>>>>       try {
>>>>>>         mapOutput.discard();
>>>>>>       } catch (IOException ignored) {
>>>>>>         LOG.info("Failed to discard map-output from " +
>>>>>>                  mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>       }
>>>>>> Please confirm the line number.
>>>>>>
>>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>>> can
>>>>>> improve it. First, I assume IOException shouldn't happen that often.
>>>>>> Second,
>>>>>> mapOutput.discard() just sets:
>>>>>>       data = null;
>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>> make more
>>>>>> memory available.
>>>>>> Of course, given the large number of map outputs in your system, it
>>>>>>
>>>>> became
>>>>>
>>>>>> more likely that the root cause from my reasoning made OOME happen
>>>>>>
>>>>> sooner.
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>
>>>>> andy.sautins@returnpath.net
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> Ted,
>>>>>>>
>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>>> following.  If you would mind helping me understand I would
>>>>>>> appreciate
>>>>>>>
>>>>>> it.
>>>>>>
>>>>>>>
>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>> determining
>>>>>>>
>>>>>> if
>>>>>>
>>>>>>> the copy _can_ fit into memory:
>>>>>>>
>>>>>>>  boolean canFitInMemory(long requestedSize) {
>>>>>>>     return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>             requestedSize < maxSingleShuffleLimit);
>>>>>>>   }
>>>>>>>
>>>>>>>  It also looks like the RamManager.reserve should wait until
>>>>>>> memory
>>>>>>>
>>>>>> is
>>>>>
>>>>>> available so it should hit a memory limit for that reason.
>>>>>>>
>>>>>>>  What does seem a little strange to me is the following (
>>>>>>>
>>>>>> ReduceTask.java
>>>>>>
>>>>>>> starting at 2730 ):
>>>>>>>
>>>>>>>       // Inform the ram-manager
>>>>>>>       ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>       ramManager.unreserve(mapOutputLength);
>>>>>>>
>>>>>>>       // Discard the map-output
>>>>>>>       try {
>>>>>>>         mapOutput.discard();
>>>>>>>       } catch (IOException ignored) {
>>>>>>>         LOG.info("Failed to discard map-output from " +
>>>>>>>                  mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>       }
>>>>>>>       mapOutput = null;
>>>>>>>
>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>> before
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>> _before_
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> ramManager unreserves the memory?  If the memory is unreserved
>>>>>>> before
>>>>>>>
>>>>>> the
>>>>>
>>>>>> actual underlying data references are removed then it seems like
>>>>>>>
>>>>>> another
>>>>>
>>>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>>> the
>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>
>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>> particular
>>>>>>>
>>>>>> job
>>>>>>
>>>>>>> that is failing does have a good number ( 200k+ ) of map
>>>>>>> outputs.  The
>>>>>>>
>>>>>> large
>>>>>>
>>>>>>> number of small map outputs may be why we are triggering a problem.
>>>>>>>
>>>>>>> Thanks again for your thoughts.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>> To: common-user@hadoop.apache.org
>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>
>>>>>>> Ted,
>>>>>>>
>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>> have
>>>>>>> some time to write a patch later this week.
>>>>>>>
>>>>>>> Jacob Rideout
>>>>>>>
>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>   this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>
>>>>>>> 5);
>>>>>
>>>>>> and:
>>>>>>>>     maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>
>>>>>>>> because
>>>>>>>>   copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>>>>>>>>
>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>
>>>>>>> apache@jacobrideout.net
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Hi all,
>>>>>>>>>
>>>>>>>>> We are seeing the following error in our reducers of a particular
>>>>>>>>>
>>>>>>>> job:
>>>>>
>>>>>>
>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>> occurs
>>>>>>>>> regardless of whether mapred.compress.map.output is true. We were
>>>>>>>>>
>>>>>>>> able
>>>>>
>>>>>> to avoid the issue by reducing
>>>>>>>>>
>>>>>>>> mapred.job.shuffle.input.buffer.percent
>>>>>
>>>>>> to 20%. Shouldn't the framework via
>>>>>>>>> ShuffleRamManager.canFitInMemory
>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>> configuration
>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>> be as
>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>
>>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>
>>>>>>>> Task
>>>>>
>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>>>>>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>
>>>>>>>> previous
>>>>>
>>>>>> failures
>>>>>>>>> 2010-03-06 07:54:50,223 FATAL
>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>> MAPREDUCE-1182
>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>>> have
>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>>> for?
>>>>>>>>>
>>>>>>>>> Jacob Rideout
>>>>>>>>> Return Path
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>

Re: Shuffle In Memory OutOfMemoryError

Posted by Christopher Douglas <ch...@yahoo-inc.com>.
That section of code is unmodified in MR-1182. See the patches/svn  
log. -C

Sent from my iPhone

On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yu...@gmail.com> wrote:

> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
> This is what I see in ReduceTask (line 999):
>      public synchronized boolean reserve(int requestedSize,  
> InputStream in)
>
>      throws InterruptedException {
>        // Wait till the request can be fulfilled...
>        while ((size + requestedSize) > maxSize) {
>
> I don't see the fix from MR-1182.
>
> That's why I suggested to Andy that he manually apply MR-1182.
>
> Cheers
>
> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net 
> >wrote:
>
>>
>> Thanks Christopher.
>>
>> The heap size for reduce tasks is configured to be 640M (
>> mapred.child.java.opts set to -Xmx640m ).
>>
>> Andy
>>
>> -----Original Message-----
>> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
>> Sent: Tuesday, March 09, 2010 5:19 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>
>> No, MR-1182 is included in 0.20.2
>>
>> What heap size have you set for your reduce tasks? -C
>>
>> Sent from my iPhone
>>
>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Andy:
>>> You need to manually apply the patch.
>>>
>>> Cheers
>>>
>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>> andy.sautins@returnpath.net
>>>> wrote:
>>>
>>>>
>>>> Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>> in the
>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>> re-ran
>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>> set to 1
>>>> and continue to have the same Java heap space error.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>> To: common-user@hadoop.apache.org
>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>
>>>> This issue has been resolved in
>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>
>>>> Please apply the patch
>>>> M1182-1v20.patch<
>>>>
>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>>>
>>>>
>>>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>> andy.sautins@returnpath.net
>>>>> wrote:
>>>>
>>>>>
>>>>> Thanks Ted.  Very helpful.  You are correct that I misunderstood  
>>>>> the
>>>> code
>>>>> at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>> IOException
>>>> catch
>>>>> block.  My mistake.  That's what I get for being in a rush.
>>>>>
>>>>> For what it's worth I did re-run the job with
>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>> down to
>>>> 1.
>>>>> All failed with the same error:
>>>>>
>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>      at
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>      at
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>      at
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>      at
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $MapOutputCopier.run
>>>> (ReduceTask.java:1195)
>>>>>
>>>>>
>>>>> So from that it does seem like something else might be going on,
>>>>> yes?
>>>> I
>>>>> need to do some more research.
>>>>>
>>>>> I appreciate your insights.
>>>>>
>>>>> Andy
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>
>>>>> My observation is based on this call chain:
>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>> calling
>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>
>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>> considering
>>>>> the
>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>> total
>>>> heap
>>>>> may be used in shuffling if default parameters were used.
>>>>> Of course, you should check the value for
>>>>> mapred.reduce.parallel.copies
>>>> to
>>>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>
>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>> 0.20.2
>>>> only
>>>>> has 2731 lines. So I have to guess the location of the code
>>>>> snippet you
>>>>> provided.
>>>>> I found this around line 1535:
>>>>>      } catch (IOException ioe) {
>>>>>        LOG.info("Failed to shuffle from " +
>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>                 ioe);
>>>>>
>>>>>        // Inform the ram-manager
>>>>>        ramManager.closeInMemoryFile(mapOutputLength);
>>>>>        ramManager.unreserve(mapOutputLength);
>>>>>
>>>>>        // Discard the map-output
>>>>>        try {
>>>>>          mapOutput.discard();
>>>>>        } catch (IOException ignored) {
>>>>>          LOG.info("Failed to discard map-output from " +
>>>>>                   mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>        }
>>>>> Please confirm the line number.
>>>>>
>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>> can
>>>>> improve it. First, I assume IOException shouldn't happen that  
>>>>> often.
>>>>> Second,
>>>>> mapOutput.discard() just sets:
>>>>>        data = null;
>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>> make more
>>>>> memory available.
>>>>> Of course, given the large number of map outputs in your system,  
>>>>> it
>>>> became
>>>>> more likely that the root cause from my reasoning made OOME happen
>>>> sooner.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>> andy.sautins@returnpath.net
>>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Ted,
>>>>>>
>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>> following.  If you would mind helping me understand I would
>>>>>> appreciate
>>>>> it.
>>>>>>
>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>> determining
>>>>> if
>>>>>> the copy _can_ fit into memory:
>>>>>>
>>>>>>   boolean canFitInMemory(long requestedSize) {
>>>>>>      return (requestedSize < Integer.MAX_VALUE &&
>>>>>>              requestedSize < maxSingleShuffleLimit);
>>>>>>    }
>>>>>>
>>>>>>  It also looks like the RamManager.reserve should wait until
>>>>>> memory
>>>> is
>>>>>> available so it should hit a memory limit for that reason.
>>>>>>
>>>>>>  What does seem a little strange to me is the following (
>>>>> ReduceTask.java
>>>>>> starting at 2730 ):
>>>>>>
>>>>>>        // Inform the ram-manager
>>>>>>        ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>        ramManager.unreserve(mapOutputLength);
>>>>>>
>>>>>>        // Discard the map-output
>>>>>>        try {
>>>>>>          mapOutput.discard();
>>>>>>        } catch (IOException ignored) {
>>>>>>          LOG.info("Failed to discard map-output from " +
>>>>>>                   mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>        }
>>>>>>        mapOutput = null;
>>>>>>
>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>> before
>>>>> the
>>>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>> _before_
>>>>> the
>>>>>> ramManager unreserves the memory?  If the memory is unreserved
>>>>>> before
>>>> the
>>>>>> actual underlying data references are removed then it seems like
>>>> another
>>>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>> the
>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>
>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>> particular
>>>>> job
>>>>>> that is failing does have a good number ( 200k+ ) of map
>>>>>> outputs.  The
>>>>> large
>>>>>> number of small map outputs may be why we are triggering a  
>>>>>> problem.
>>>>>>
>>>>>> Thanks again for your thoughts.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>
>>>>>> Ted,
>>>>>>
>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>> have
>>>>>> some time to write a patch later this week.
>>>>>>
>>>>>> Jacob Rideout
>>>>>>
>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
>>>>>> wrote:
>>>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>>>    this.numCopiers = conf.getInt 
>>>>>>> ("mapred.reduce.parallel.copies",
>>>> 5);
>>>>>>> and:
>>>>>>>      maxSingleShuffleLimit = (long)(maxSize *
>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>
>>>>>>> because
>>>>>>>    copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 *  
>>>>>>> maxSize
>>>>>>>
>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>> apache@jacobrideout.net
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We are seeing the following error in our reducers of a  
>>>>>>>> particular
>>>> job:
>>>>>>>>
>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $MapOutputCopier.run
>>>> (ReduceTask.java:1195)
>>>>>>>>
>>>>>>>>
>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>> occurs
>>>>>>>> regardless of whether mapred.compress.map.output is true. We  
>>>>>>>> were
>>>> able
>>>>>>>> to avoid the issue by reducing
>>>> mapred.job.shuffle.input.buffer.percent
>>>>>>>> to 20%. Shouldn't the framework via
>>>>>>>> ShuffleRamManager.canFitInMemory
>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>> available for allocation? I would think that with poor
>>>>>>>> configuration
>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>> be as
>>>>>>>> efficient, but wouldn't die.
>>>>>>>>
>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>
>>>>>>>>
>>>>>>>> 2010-03-06 07:54:49,621 INFO  
>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>> 2010-03-06 07:54:50,222 INFO  
>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>> Task
>>>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>>>> 2010-03-06 07:54:50,223 WARN  
>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4  
>>>>>>>> seconds
>>>>>>>> 2010-03-06 07:54:50,223 INFO  
>>>>>>>> org.apache.hadoop.mapred.ReduceTask:
>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>> previous
>>>>>>>> failures
>>>>>>>> 2010-03-06 07:54:50,223 FATAL
>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy  
>>>>>>>> failure :
>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>>>      at
>>>>>>>>
>>>>>>
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>>>> $MapOutputCopier.run
>>>> (ReduceTask.java:1195)
>>>>>>>>
>>>>>>>>
>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>> MAPREDUCE-1182
>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>> have
>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>> for?
>>>>>>>>
>>>>>>>> Jacob Rideout
>>>>>>>> Return Path
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
This is what I see in ReduceTask (line 999):
      public synchronized boolean reserve(int requestedSize, InputStream in)

      throws InterruptedException {
        // Wait till the request can be fulfilled...
        while ((size + requestedSize) > maxSize) {

I don't see the fix from MR-1182.

That's why I suggested to Andy that he manually apply MR-1182.

Cheers

On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>  Thanks Christopher.
>
>  The heap size for reduce tasks is configured to be 640M (
> mapred.child.java.opts set to -Xmx640m ).
>
>  Andy
>
> -----Original Message-----
> From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com]
> Sent: Tuesday, March 09, 2010 5:19 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> No, MR-1182 is included in 0.20.2
>
> What heap size have you set for your reduce tasks? -C
>
> Sent from my iPhone
>
> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> > Andy:
> > You need to manually apply the patch.
> >
> > Cheers
> >
> > On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
> andy.sautins@returnpath.net
> > >wrote:
> >
> >>
> >>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
> >> in the
> >> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
> >> re-ran
> >> the same job scenarios.  Running with mapred.reduce.parallel.copies
> >> set to 1
> >> and continue to have the same Java heap space error.
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> Sent: Tuesday, March 09, 2010 12:56 PM
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>
> >> This issue has been resolved in
> >> http://issues.apache.org/jira/browse/MAPREDUCE-1182
> >>
> >> Please apply the patch
> >> M1182-1v20.patch<
> >>
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
> >> >
> >>
> >> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
> andy.sautins@returnpath.net
> >>> wrote:
> >>
> >>>
> >>> Thanks Ted.  Very helpful.  You are correct that I misunderstood the
> >> code
> >>> at ReduceTask.java:1535.  I missed the fact that it's in a
> >>> IOException
> >> catch
> >>> block.  My mistake.  That's what I get for being in a rush.
> >>>
> >>> For what it's worth I did re-run the job with
> >>> mapred.reduce.parallel.copies set with values from 5 all the way
> >>> down to
> >> 1.
> >>> All failed with the same error:
> >>>
> >>> Error: java.lang.OutOfMemoryError: Java heap space
> >>>       at
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>       at
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>       at
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>       at
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >> (ReduceTask.java:1195)
> >>>
> >>>
> >>>  So from that it does seem like something else might be going on,
> >>> yes?
> >> I
> >>> need to do some more research.
> >>>
> >>> I appreciate your insights.
> >>>
> >>> Andy
> >>>
> >>> -----Original Message-----
> >>> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >>> Sent: Sunday, March 07, 2010 3:38 PM
> >>> To: common-user@hadoop.apache.org
> >>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>
> >>> My observation is based on this call chain:
> >>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
> >>> calling
> >>> ramManager.canFitInMemory(decompressedLength)
> >>>
> >>> Basically ramManager.canFitInMemory() makes decision without
> >>> considering
> >>> the
> >>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
> >>> total
> >> heap
> >>> may be used in shuffling if default parameters were used.
> >>> Of course, you should check the value for
> >>> mapred.reduce.parallel.copies
> >> to
> >>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
> >>>
> >>> About ramManager.unreserve() call, ReduceTask.java from hadoop
> >>> 0.20.2
> >> only
> >>> has 2731 lines. So I have to guess the location of the code
> >>> snippet you
> >>> provided.
> >>> I found this around line 1535:
> >>>       } catch (IOException ioe) {
> >>>         LOG.info("Failed to shuffle from " +
> >>> mapOutputLoc.getTaskAttemptId(),
> >>>                  ioe);
> >>>
> >>>         // Inform the ram-manager
> >>>         ramManager.closeInMemoryFile(mapOutputLength);
> >>>         ramManager.unreserve(mapOutputLength);
> >>>
> >>>         // Discard the map-output
> >>>         try {
> >>>           mapOutput.discard();
> >>>         } catch (IOException ignored) {
> >>>           LOG.info("Failed to discard map-output from " +
> >>>                    mapOutputLoc.getTaskAttemptId(), ignored);
> >>>         }
> >>> Please confirm the line number.
> >>>
> >>> If we're looking at the same code, I am afraid I don't see how we
> >>> can
> >>> improve it. First, I assume IOException shouldn't happen that often.
> >>> Second,
> >>> mapOutput.discard() just sets:
> >>>         data = null;
> >>> for in memory case. Even if we call mapOutput.discard() before
> >>> ramManager.unreserve(), we don't know when GC would kick in and
> >>> make more
> >>> memory available.
> >>> Of course, given the large number of map outputs in your system, it
> >> became
> >>> more likely that the root cause from my reasoning made OOME happen
> >> sooner.
> >>>
> >>> Thanks
> >>>
> >>>>
> >>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> >> andy.sautins@returnpath.net
> >>>> wrote:
> >>>
> >>>>
> >>>>  Ted,
> >>>>
> >>>>  I'm trying to follow the logic in your mail and I'm not sure I'm
> >>>> following.  If you would mind helping me understand I would
> >>>> appreciate
> >>> it.
> >>>>
> >>>>  Looking at the code maxSingleShuffleLimit is only used in
> >>>> determining
> >>> if
> >>>> the copy _can_ fit into memory:
> >>>>
> >>>>    boolean canFitInMemory(long requestedSize) {
> >>>>       return (requestedSize < Integer.MAX_VALUE &&
> >>>>               requestedSize < maxSingleShuffleLimit);
> >>>>     }
> >>>>
> >>>>   It also looks like the RamManager.reserve should wait until
> >>>> memory
> >> is
> >>>> available so it should hit a memory limit for that reason.
> >>>>
> >>>>   What does seem a little strange to me is the following (
> >>> ReduceTask.java
> >>>> starting at 2730 ):
> >>>>
> >>>>         // Inform the ram-manager
> >>>>         ramManager.closeInMemoryFile(mapOutputLength);
> >>>>         ramManager.unreserve(mapOutputLength);
> >>>>
> >>>>         // Discard the map-output
> >>>>         try {
> >>>>           mapOutput.discard();
> >>>>         } catch (IOException ignored) {
> >>>>           LOG.info("Failed to discard map-output from " +
> >>>>                    mapOutputLoc.getTaskAttemptId(), ignored);
> >>>>         }
> >>>>         mapOutput = null;
> >>>>
> >>>>  So to me that looks like the ramManager unreserves the memory
> >>>> before
> >>> the
> >>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded
> >>>> _before_
> >>> the
> >>>> ramManager unreserves the memory?  If the memory is unreserved
> >>>> before
> >> the
> >>>> actual underlying data references are removed then it seems like
> >> another
> >>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before
> >>>> the
> >>>> previous memory is disposed ( mapOutput.discard() ).
> >>>>
> >>>>  Not sure that makes sense.  One thing to note is that the
> >>>> particular
> >>> job
> >>>> that is failing does have a good number ( 200k+ ) of map
> >>>> outputs.  The
> >>> large
> >>>> number of small map outputs may be why we are triggering a problem.
> >>>>
> >>>>  Thanks again for your thoughts.
> >>>>
> >>>>  Andy
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> >>>> Sent: Sunday, March 07, 2010 1:21 PM
> >>>> To: common-user@hadoop.apache.org
> >>>> Cc: Andy Sautins; Ted Yu
> >>>> Subject: Re: Shuffle In Memory OutOfMemoryError
> >>>>
> >>>> Ted,
> >>>>
> >>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
> >>>> have
> >>>> some time to write a patch later this week.
> >>>>
> >>>> Jacob Rideout
> >>>>
> >>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>
> >>>> wrote:
> >>>>> I think there is mismatch (in ReduceTask.java) between:
> >>>>>     this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
> >> 5);
> >>>>> and:
> >>>>>       maxSingleShuffleLimit = (long)(maxSize *
> >>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> >>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> >>>>>
> >>>>> because
> >>>>>     copiers = new ArrayList<MapOutputCopier>(numCopiers);
> >>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> >>>>>
> >>>>> A JIRA should be filed to correlate the constant 5 above and
> >>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> >>> apache@jacobrideout.net
> >>>>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> We are seeing the following error in our reducers of a particular
> >> job:
> >>>>>>
> >>>>>> Error: java.lang.OutOfMemoryError: Java heap space
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >> (ReduceTask.java:1195)
> >>>>>>
> >>>>>>
> >>>>>> After enough reducers fail the entire job fails. This error
> >>>>>> occurs
> >>>>>> regardless of whether mapred.compress.map.output is true. We were
> >> able
> >>>>>> to avoid the issue by reducing
> >> mapred.job.shuffle.input.buffer.percent
> >>>>>> to 20%. Shouldn't the framework via
> >>>>>> ShuffleRamManager.canFitInMemory
> >>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
> >>>>>> available for allocation? I would think that with poor
> >>>>>> configuration
> >>>>>> settings (and default settings in particular) the job may not
> >>>>>> be as
> >>>>>> efficient, but wouldn't die.
> >>>>>>
> >>>>>> Here is some more context in the logs, I have attached the full
> >>>>>> reducer log here: http://gist.github.com/323746
> >>>>>>
> >>>>>>
> >>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> >>>>>> attempt_201003060739_0002_m_000061_0
> >>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
> >> Task
> >>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> >>>>>> attempt_201003060739_0002_m_000202_0
> >>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> >>>>>> attempt_201003060739_0002_r_000000_0 adding host
> >>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> >>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
> >> previous
> >>>>>> failures
> >>>>>> 2010-03-06 07:54:50,223 FATAL
> >>>>>> org.apache.hadoop.mapred.TaskRunner:
> >>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> >>>>>> java.lang.OutOfMemoryError: Java heap space
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> >> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>>>>>       at
> >>>>>>
> >>>>
> >>>
> >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
> >> (ReduceTask.java:1195)
> >>>>>>
> >>>>>>
> >>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
> >>>>>> MAPREDUCE-1182
> >>>>>> would address the issue in 0.20.2, but it did not. Does anyone
> >>>>>> have
> >>>>>> any comments or suggestions? Is this a bug I should file a JIRA
> >>>>>> for?
> >>>>>>
> >>>>>> Jacob Rideout
> >>>>>> Return Path
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

RE: Shuffle In Memory OutOfMemoryError

Posted by Andy Sautins <an...@returnpath.net>.
  Thanks Christopher.  

  The heap size for reduce tasks is configured to be 640M ( mapred.child.java.opts set to -Xmx640m ).

  Andy

-----Original Message-----
From: Christopher Douglas [mailto:chrisdo@yahoo-inc.com] 
Sent: Tuesday, March 09, 2010 5:19 PM
To: common-user@hadoop.apache.org
Subject: Re: Shuffle In Memory OutOfMemoryError

No, MR-1182 is included in 0.20.2

What heap size have you set for your reduce tasks? -C

Sent from my iPhone

On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:

> Andy:
> You need to manually apply the patch.
>
> Cheers
>
> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <andy.sautins@returnpath.net 
> >wrote:
>
>>
>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included  
>> in the
>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and  
>> re-ran
>> the same job scenarios.  Running with mapred.reduce.parallel.copies  
>> set to 1
>> and continue to have the same Java heap space error.
>>
>>
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>> Sent: Tuesday, March 09, 2010 12:56 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>
>> This issue has been resolved in
>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>
>> Please apply the patch
>> M1182-1v20.patch<
>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch 
>> >
>>
>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
>>> wrote:
>>
>>>
>>> Thanks Ted.  Very helpful.  You are correct that I misunderstood the
>> code
>>> at ReduceTask.java:1535.  I missed the fact that it's in a  
>>> IOException
>> catch
>>> block.  My mistake.  That's what I get for being in a rush.
>>>
>>> For what it's worth I did re-run the job with
>>> mapred.reduce.parallel.copies set with values from 5 all the way  
>>> down to
>> 1.
>>> All failed with the same error:
>>>
>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>
>>>
>>>  So from that it does seem like something else might be going on,  
>>> yes?
>> I
>>> need to do some more research.
>>>
>>> I appreciate your insights.
>>>
>>> Andy
>>>
>>> -----Original Message-----
>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>> Sent: Sunday, March 07, 2010 3:38 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>
>>> My observation is based on this call chain:
>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()  
>>> calling
>>> ramManager.canFitInMemory(decompressedLength)
>>>
>>> Basically ramManager.canFitInMemory() makes decision without  
>>> considering
>>> the
>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of  
>>> total
>> heap
>>> may be used in shuffling if default parameters were used.
>>> Of course, you should check the value for  
>>> mapred.reduce.parallel.copies
>> to
>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>
>>> About ramManager.unreserve() call, ReduceTask.java from hadoop  
>>> 0.20.2
>> only
>>> has 2731 lines. So I have to guess the location of the code  
>>> snippet you
>>> provided.
>>> I found this around line 1535:
>>>       } catch (IOException ioe) {
>>>         LOG.info("Failed to shuffle from " +
>>> mapOutputLoc.getTaskAttemptId(),
>>>                  ioe);
>>>
>>>         // Inform the ram-manager
>>>         ramManager.closeInMemoryFile(mapOutputLength);
>>>         ramManager.unreserve(mapOutputLength);
>>>
>>>         // Discard the map-output
>>>         try {
>>>           mapOutput.discard();
>>>         } catch (IOException ignored) {
>>>           LOG.info("Failed to discard map-output from " +
>>>                    mapOutputLoc.getTaskAttemptId(), ignored);
>>>         }
>>> Please confirm the line number.
>>>
>>> If we're looking at the same code, I am afraid I don't see how we  
>>> can
>>> improve it. First, I assume IOException shouldn't happen that often.
>>> Second,
>>> mapOutput.discard() just sets:
>>>         data = null;
>>> for in memory case. Even if we call mapOutput.discard() before
>>> ramManager.unreserve(), we don't know when GC would kick in and  
>>> make more
>>> memory available.
>>> Of course, given the large number of map outputs in your system, it
>> became
>>> more likely that the root cause from my reasoning made OOME happen
>> sooner.
>>>
>>> Thanks
>>>
>>>>
>>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>> andy.sautins@returnpath.net
>>>> wrote:
>>>
>>>>
>>>>  Ted,
>>>>
>>>>  I'm trying to follow the logic in your mail and I'm not sure I'm
>>>> following.  If you would mind helping me understand I would  
>>>> appreciate
>>> it.
>>>>
>>>>  Looking at the code maxSingleShuffleLimit is only used in  
>>>> determining
>>> if
>>>> the copy _can_ fit into memory:
>>>>
>>>>    boolean canFitInMemory(long requestedSize) {
>>>>       return (requestedSize < Integer.MAX_VALUE &&
>>>>               requestedSize < maxSingleShuffleLimit);
>>>>     }
>>>>
>>>>   It also looks like the RamManager.reserve should wait until  
>>>> memory
>> is
>>>> available so it should hit a memory limit for that reason.
>>>>
>>>>   What does seem a little strange to me is the following (
>>> ReduceTask.java
>>>> starting at 2730 ):
>>>>
>>>>         // Inform the ram-manager
>>>>         ramManager.closeInMemoryFile(mapOutputLength);
>>>>         ramManager.unreserve(mapOutputLength);
>>>>
>>>>         // Discard the map-output
>>>>         try {
>>>>           mapOutput.discard();
>>>>         } catch (IOException ignored) {
>>>>           LOG.info("Failed to discard map-output from " +
>>>>                    mapOutputLoc.getTaskAttemptId(), ignored);
>>>>         }
>>>>         mapOutput = null;
>>>>
>>>>  So to me that looks like the ramManager unreserves the memory  
>>>> before
>>> the
>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded  
>>>> _before_
>>> the
>>>> ramManager unreserves the memory?  If the memory is unreserved  
>>>> before
>> the
>>>> actual underlying data references are removed then it seems like
>> another
>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before  
>>>> the
>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>
>>>>  Not sure that makes sense.  One thing to note is that the  
>>>> particular
>>> job
>>>> that is failing does have a good number ( 200k+ ) of map  
>>>> outputs.  The
>>> large
>>>> number of small map outputs may be why we are triggering a problem.
>>>>
>>>>  Thanks again for your thoughts.
>>>>
>>>>  Andy
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>> To: common-user@hadoop.apache.org
>>>> Cc: Andy Sautins; Ted Yu
>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>
>>>> Ted,
>>>>
>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might  
>>>> have
>>>> some time to write a patch later this week.
>>>>
>>>> Jacob Rideout
>>>>
>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>  
>>>> wrote:
>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>     this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>> 5);
>>>>> and:
>>>>>       maxSingleShuffleLimit = (long)(maxSize *
>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>
>>>>> because
>>>>>     copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>>>>>
>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>> apache@jacobrideout.net
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We are seeing the following error in our reducers of a particular
>> job:
>>>>>>
>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>>>>
>>>>>>
>>>>>> After enough reducers fail the entire job fails. This error  
>>>>>> occurs
>>>>>> regardless of whether mapred.compress.map.output is true. We were
>> able
>>>>>> to avoid the issue by reducing
>> mapred.job.shuffle.input.buffer.percent
>>>>>> to 20%. Shouldn't the framework via  
>>>>>> ShuffleRamManager.canFitInMemory
>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>> available for allocation? I would think that with poor  
>>>>>> configuration
>>>>>> settings (and default settings in particular) the job may not  
>>>>>> be as
>>>>>> efficient, but wouldn't die.
>>>>>>
>>>>>> Here is some more context in the logs, I have attached the full
>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>
>>>>>>
>>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
>> Task
>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>> previous
>>>>>> failures
>>>>>> 2010-03-06 07:54:50,223 FATAL  
>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>>>>
>>>>>>
>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped  
>>>>>> MAPREDUCE-1182
>>>>>> would address the issue in 0.20.2, but it did not. Does anyone  
>>>>>> have
>>>>>> any comments or suggestions? Is this a bug I should file a JIRA  
>>>>>> for?
>>>>>>
>>>>>> Jacob Rideout
>>>>>> Return Path
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Shuffle In Memory OutOfMemoryError

Posted by Christopher Douglas <ch...@yahoo-inc.com>.
No, MR-1182 is included in 0.20.2

What heap size have you set for your reduce tasks? -C

Sent from my iPhone

On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yu...@gmail.com> wrote:

> Andy:
> You need to manually apply the patch.
>
> Cheers
>
> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <andy.sautins@returnpath.net 
> >wrote:
>
>>
>>  Thanks Ted.  My understanding is that MAPREDUCE-1182 is included  
>> in the
>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and  
>> re-ran
>> the same job scenarios.  Running with mapred.reduce.parallel.copies  
>> set to 1
>> and continue to have the same Java heap space error.
>>
>>
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>> Sent: Tuesday, March 09, 2010 12:56 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>
>> This issue has been resolved in
>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>
>> Please apply the patch
>> M1182-1v20.patch<
>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch 
>> >
>>
>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
>>> wrote:
>>
>>>
>>> Thanks Ted.  Very helpful.  You are correct that I misunderstood the
>> code
>>> at ReduceTask.java:1535.  I missed the fact that it's in a  
>>> IOException
>> catch
>>> block.  My mistake.  That's what I get for being in a rush.
>>>
>>> For what it's worth I did re-run the job with
>>> mapred.reduce.parallel.copies set with values from 5 all the way  
>>> down to
>> 1.
>>> All failed with the same error:
>>>
>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>       at
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>
>>>
>>>  So from that it does seem like something else might be going on,  
>>> yes?
>> I
>>> need to do some more research.
>>>
>>> I appreciate your insights.
>>>
>>> Andy
>>>
>>> -----Original Message-----
>>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>>> Sent: Sunday, March 07, 2010 3:38 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>
>>> My observation is based on this call chain:
>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()  
>>> calling
>>> ramManager.canFitInMemory(decompressedLength)
>>>
>>> Basically ramManager.canFitInMemory() makes decision without  
>>> considering
>>> the
>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of  
>>> total
>> heap
>>> may be used in shuffling if default parameters were used.
>>> Of course, you should check the value for  
>>> mapred.reduce.parallel.copies
>> to
>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>
>>> About ramManager.unreserve() call, ReduceTask.java from hadoop  
>>> 0.20.2
>> only
>>> has 2731 lines. So I have to guess the location of the code  
>>> snippet you
>>> provided.
>>> I found this around line 1535:
>>>       } catch (IOException ioe) {
>>>         LOG.info("Failed to shuffle from " +
>>> mapOutputLoc.getTaskAttemptId(),
>>>                  ioe);
>>>
>>>         // Inform the ram-manager
>>>         ramManager.closeInMemoryFile(mapOutputLength);
>>>         ramManager.unreserve(mapOutputLength);
>>>
>>>         // Discard the map-output
>>>         try {
>>>           mapOutput.discard();
>>>         } catch (IOException ignored) {
>>>           LOG.info("Failed to discard map-output from " +
>>>                    mapOutputLoc.getTaskAttemptId(), ignored);
>>>         }
>>> Please confirm the line number.
>>>
>>> If we're looking at the same code, I am afraid I don't see how we  
>>> can
>>> improve it. First, I assume IOException shouldn't happen that often.
>>> Second,
>>> mapOutput.discard() just sets:
>>>         data = null;
>>> for in memory case. Even if we call mapOutput.discard() before
>>> ramManager.unreserve(), we don't know when GC would kick in and  
>>> make more
>>> memory available.
>>> Of course, given the large number of map outputs in your system, it
>> became
>>> more likely that the root cause from my reasoning made OOME happen
>> sooner.
>>>
>>> Thanks
>>>
>>>>
>>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>> andy.sautins@returnpath.net
>>>> wrote:
>>>
>>>>
>>>>  Ted,
>>>>
>>>>  I'm trying to follow the logic in your mail and I'm not sure I'm
>>>> following.  If you would mind helping me understand I would  
>>>> appreciate
>>> it.
>>>>
>>>>  Looking at the code maxSingleShuffleLimit is only used in  
>>>> determining
>>> if
>>>> the copy _can_ fit into memory:
>>>>
>>>>    boolean canFitInMemory(long requestedSize) {
>>>>       return (requestedSize < Integer.MAX_VALUE &&
>>>>               requestedSize < maxSingleShuffleLimit);
>>>>     }
>>>>
>>>>   It also looks like the RamManager.reserve should wait until  
>>>> memory
>> is
>>>> available so it should hit a memory limit for that reason.
>>>>
>>>>   What does seem a little strange to me is the following (
>>> ReduceTask.java
>>>> starting at 2730 ):
>>>>
>>>>         // Inform the ram-manager
>>>>         ramManager.closeInMemoryFile(mapOutputLength);
>>>>         ramManager.unreserve(mapOutputLength);
>>>>
>>>>         // Discard the map-output
>>>>         try {
>>>>           mapOutput.discard();
>>>>         } catch (IOException ignored) {
>>>>           LOG.info("Failed to discard map-output from " +
>>>>                    mapOutputLoc.getTaskAttemptId(), ignored);
>>>>         }
>>>>         mapOutput = null;
>>>>
>>>>  So to me that looks like the ramManager unreserves the memory  
>>>> before
>>> the
>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded  
>>>> _before_
>>> the
>>>> ramManager unreserves the memory?  If the memory is unreserved  
>>>> before
>> the
>>>> actual underlying data references are removed then it seems like
>> another
>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before  
>>>> the
>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>
>>>>  Not sure that makes sense.  One thing to note is that the  
>>>> particular
>>> job
>>>> that is failing does have a good number ( 200k+ ) of map  
>>>> outputs.  The
>>> large
>>>> number of small map outputs may be why we are triggering a problem.
>>>>
>>>>  Thanks again for your thoughts.
>>>>
>>>>  Andy
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>> To: common-user@hadoop.apache.org
>>>> Cc: Andy Sautins; Ted Yu
>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>
>>>> Ted,
>>>>
>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might  
>>>> have
>>>> some time to write a patch later this week.
>>>>
>>>> Jacob Rideout
>>>>
>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com>  
>>>> wrote:
>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>     this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>> 5);
>>>>> and:
>>>>>       maxSingleShuffleLimit = (long)(maxSize *
>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>
>>>>> because
>>>>>     copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>>>>>
>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>> apache@jacobrideout.net
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We are seeing the following error in our reducers of a particular
>> job:
>>>>>>
>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>>>>
>>>>>>
>>>>>> After enough reducers fail the entire job fails. This error  
>>>>>> occurs
>>>>>> regardless of whether mapred.compress.map.output is true. We were
>> able
>>>>>> to avoid the issue by reducing
>> mapred.job.shuffle.input.buffer.percent
>>>>>> to 20%. Shouldn't the framework via  
>>>>>> ShuffleRamManager.canFitInMemory
>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>> available for allocation? I would think that with poor  
>>>>>> configuration
>>>>>> settings (and default settings in particular) the job may not  
>>>>>> be as
>>>>>> efficient, but wouldn't die.
>>>>>>
>>>>>> Here is some more context in the logs, I have attached the full
>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>
>>>>>>
>>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
>> Task
>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>> previous
>>>>>> failures
>>>>>> 2010-03-06 07:54:50,223 FATAL  
>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>>       at
>>>>>>
>>>>
>>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run 
>> (ReduceTask.java:1195)
>>>>>>
>>>>>>
>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped  
>>>>>> MAPREDUCE-1182
>>>>>> would address the issue in 0.20.2, but it did not. Does anyone  
>>>>>> have
>>>>>> any comments or suggestions? Is this a bug I should file a JIRA  
>>>>>> for?
>>>>>>
>>>>>> Jacob Rideout
>>>>>> Return Path
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
Andy:
You need to manually apply the patch.

Cheers

On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the
> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran
> the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1
> and continue to have the same Java heap space error.
>
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Tuesday, March 09, 2010 12:56 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> This issue has been resolved in
> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>
> Please apply the patch
> M1182-1v20.patch<
> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>
>
> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >  Thanks Ted.  Very helpful.  You are correct that I misunderstood the
> code
> > at ReduceTask.java:1535.  I missed the fact that it's in a IOException
> catch
> > block.  My mistake.  That's what I get for being in a rush.
> >
> >  For what it's worth I did re-run the job with
> > mapred.reduce.parallel.copies set with values from 5 all the way down to
> 1.
> >  All failed with the same error:
> >
> > Error: java.lang.OutOfMemoryError: Java heap space
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >        at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >
> >
> >   So from that it does seem like something else might be going on, yes?
>  I
> > need to do some more research.
> >
> >  I appreciate your insights.
> >
> >  Andy
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Sunday, March 07, 2010 3:38 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > My observation is based on this call chain:
> > MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> > ramManager.canFitInMemory(decompressedLength)
> >
> > Basically ramManager.canFitInMemory() makes decision without considering
> > the
> > number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total
> heap
> > may be used in shuffling if default parameters were used.
> > Of course, you should check the value for mapred.reduce.parallel.copies
> to
> > see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
> >
> > About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2
> only
> > has 2731 lines. So I have to guess the location of the code snippet you
> > provided.
> > I found this around line 1535:
> >        } catch (IOException ioe) {
> >          LOG.info("Failed to shuffle from " +
> > mapOutputLoc.getTaskAttemptId(),
> >                   ioe);
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> > Please confirm the line number.
> >
> > If we're looking at the same code, I am afraid I don't see how we can
> > improve it. First, I assume IOException shouldn't happen that often.
> > Second,
> > mapOutput.discard() just sets:
> >          data = null;
> > for in memory case. Even if we call mapOutput.discard() before
> > ramManager.unreserve(), we don't know when GC would kick in and make more
> > memory available.
> > Of course, given the large number of map outputs in your system, it
> became
> > more likely that the root cause from my reasoning made OOME happen
> sooner.
> >
> > Thanks
> >
> > >
> > On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
> andy.sautins@returnpath.net
> > >wrote:
> >
> > >
> > >   Ted,
> > >
> > >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > > following.  If you would mind helping me understand I would appreciate
> > it.
> > >
> > >   Looking at the code maxSingleShuffleLimit is only used in determining
> > if
> > > the copy _can_ fit into memory:
> > >
> > >     boolean canFitInMemory(long requestedSize) {
> > >        return (requestedSize < Integer.MAX_VALUE &&
> > >                requestedSize < maxSingleShuffleLimit);
> > >      }
> > >
> > >    It also looks like the RamManager.reserve should wait until memory
> is
> > > available so it should hit a memory limit for that reason.
> > >
> > >    What does seem a little strange to me is the following (
> > ReduceTask.java
> > > starting at 2730 ):
> > >
> > >          // Inform the ram-manager
> > >          ramManager.closeInMemoryFile(mapOutputLength);
> > >          ramManager.unreserve(mapOutputLength);
> > >
> > >          // Discard the map-output
> > >          try {
> > >            mapOutput.discard();
> > >          } catch (IOException ignored) {
> > >            LOG.info("Failed to discard map-output from " +
> > >                     mapOutputLoc.getTaskAttemptId(), ignored);
> > >          }
> > >          mapOutput = null;
> > >
> > >   So to me that looks like the ramManager unreserves the memory before
> > the
> > > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> > the
> > > ramManager unreserves the memory?  If the memory is unreserved before
> the
> > > actual underlying data references are removed then it seems like
> another
> > > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > > previous memory is disposed ( mapOutput.discard() ).
> > >
> > >   Not sure that makes sense.  One thing to note is that the particular
> > job
> > > that is failing does have a good number ( 200k+ ) of map outputs.  The
> > large
> > > number of small map outputs may be why we are triggering a problem.
> > >
> > >   Thanks again for your thoughts.
> > >
> > >   Andy
> > >
> > >
> > > -----Original Message-----
> > > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > > Sent: Sunday, March 07, 2010 1:21 PM
> > > To: common-user@hadoop.apache.org
> > > Cc: Andy Sautins; Ted Yu
> > > Subject: Re: Shuffle In Memory OutOfMemoryError
> > >
> > > Ted,
> > >
> > > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > > some time to write a patch later this week.
> > >
> > > Jacob Rideout
> > >
> > > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > > I think there is mismatch (in ReduceTask.java) between:
> > > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
> 5);
> > > > and:
> > > >        maxSingleShuffleLimit = (long)(maxSize *
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > > >
> > > > because
> > > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > > >
> > > > A JIRA should be filed to correlate the constant 5 above and
> > > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> > apache@jacobrideout.net
> > > >wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> We are seeing the following error in our reducers of a particular
> job:
> > > >>
> > > >> Error: java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> After enough reducers fail the entire job fails. This error occurs
> > > >> regardless of whether mapred.compress.map.output is true. We were
> able
> > > >> to avoid the issue by reducing
> mapred.job.shuffle.input.buffer.percent
> > > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > > >> available for allocation? I would think that with poor configuration
> > > >> settings (and default settings in particular) the job may not be as
> > > >> efficient, but wouldn't die.
> > > >>
> > > >> Here is some more context in the logs, I have attached the full
> > > >> reducer log here: http://gist.github.com/323746
> > > >>
> > > >>
> > > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > > >> attempt_201003060739_0002_m_000061_0
> > > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
> Task
> > > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > > >> attempt_201003060739_0002_m_000202_0
> > > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0 adding host
> > > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
> previous
> > > >> failures
> > > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > > >> java.lang.OutOfMemoryError: Java heap space
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > > >>        at
> > > >>
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > > >>
> > > >>
> > > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > > >>
> > > >> Jacob Rideout
> > > >> Return Path
> > > >>
> > > >
> > >
> >
>

RE: Shuffle In Memory OutOfMemoryError

Posted by Andy Sautins <an...@returnpath.net>.
   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1 and continue to have the same Java heap space error.

    

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Tuesday, March 09, 2010 12:56 PM
To: common-user@hadoop.apache.org
Subject: Re: Shuffle In Memory OutOfMemoryError

This issue has been resolved in
http://issues.apache.org/jira/browse/MAPREDUCE-1182

Please apply the patch
M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code
> at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch
> block.  My mistake.  That's what I get for being in a rush.
>
>  For what it's worth I did re-run the job with
> mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
>  All failed with the same error:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
>   So from that it does seem like something else might be going on, yes?  I
> need to do some more research.
>
>  I appreciate your insights.
>
>  Andy
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Sunday, March 07, 2010 3:38 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> My observation is based on this call chain:
> MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> ramManager.canFitInMemory(decompressedLength)
>
> Basically ramManager.canFitInMemory() makes decision without considering
> the
> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
> may be used in shuffling if default parameters were used.
> Of course, you should check the value for mapred.reduce.parallel.copies to
> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>
> About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
> has 2731 lines. So I have to guess the location of the code snippet you
> provided.
> I found this around line 1535:
>        } catch (IOException ioe) {
>          LOG.info("Failed to shuffle from " +
> mapOutputLoc.getTaskAttemptId(),
>                   ioe);
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
> Please confirm the line number.
>
> If we're looking at the same code, I am afraid I don't see how we can
> improve it. First, I assume IOException shouldn't happen that often.
> Second,
> mapOutput.discard() just sets:
>          data = null;
> for in memory case. Even if we call mapOutput.discard() before
> ramManager.unreserve(), we don't know when GC would kick in and make more
> memory available.
> Of course, given the large number of map outputs in your system, it became
> more likely that the root cause from my reasoning made OOME happen sooner.
>
> Thanks
>
> >
> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >   Ted,
> >
> >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > following.  If you would mind helping me understand I would appreciate
> it.
> >
> >   Looking at the code maxSingleShuffleLimit is only used in determining
> if
> > the copy _can_ fit into memory:
> >
> >     boolean canFitInMemory(long requestedSize) {
> >        return (requestedSize < Integer.MAX_VALUE &&
> >                requestedSize < maxSingleShuffleLimit);
> >      }
> >
> >    It also looks like the RamManager.reserve should wait until memory is
> > available so it should hit a memory limit for that reason.
> >
> >    What does seem a little strange to me is the following (
> ReduceTask.java
> > starting at 2730 ):
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> >          mapOutput = null;
> >
> >   So to me that looks like the ramManager unreserves the memory before
> the
> > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> the
> > ramManager unreserves the memory?  If the memory is unreserved before the
> > actual underlying data references are removed then it seems like another
> > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > previous memory is disposed ( mapOutput.discard() ).
> >
> >   Not sure that makes sense.  One thing to note is that the particular
> job
> > that is failing does have a good number ( 200k+ ) of map outputs.  The
> large
> > number of small map outputs may be why we are triggering a problem.
> >
> >   Thanks again for your thoughts.
> >
> >   Andy
> >
> >
> > -----Original Message-----
> > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > Sent: Sunday, March 07, 2010 1:21 PM
> > To: common-user@hadoop.apache.org
> > Cc: Andy Sautins; Ted Yu
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > Ted,
> >
> > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > some time to write a patch later this week.
> >
> > Jacob Rideout
> >
> > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > I think there is mismatch (in ReduceTask.java) between:
> > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> > > and:
> > >        maxSingleShuffleLimit = (long)(maxSize *
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > >
> > > because
> > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > >
> > > A JIRA should be filed to correlate the constant 5 above and
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > >
> > > Cheers
> > >
> > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> apache@jacobrideout.net
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are seeing the following error in our reducers of a particular job:
> > >>
> > >> Error: java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> After enough reducers fail the entire job fails. This error occurs
> > >> regardless of whether mapred.compress.map.output is true. We were able
> > >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > >> available for allocation? I would think that with poor configuration
> > >> settings (and default settings in particular) the job may not be as
> > >> efficient, but wouldn't die.
> > >>
> > >> Here is some more context in the logs, I have attached the full
> > >> reducer log here: http://gist.github.com/323746
> > >>
> > >>
> > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > >> attempt_201003060739_0002_m_000061_0
> > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > >> attempt_201003060739_0002_m_000202_0
> > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0 adding host
> > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> > >> failures
> > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > >> java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > >>
> > >> Jacob Rideout
> > >> Return Path
> > >>
> > >
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
This issue has been resolved in
http://issues.apache.org/jira/browse/MAPREDUCE-1182

Please apply the patch
M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code
> at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch
> block.  My mistake.  That's what I get for being in a rush.
>
>  For what it's worth I did re-run the job with
> mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
>  All failed with the same error:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
>   So from that it does seem like something else might be going on, yes?  I
> need to do some more research.
>
>  I appreciate your insights.
>
>  Andy
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Sunday, March 07, 2010 3:38 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> My observation is based on this call chain:
> MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> ramManager.canFitInMemory(decompressedLength)
>
> Basically ramManager.canFitInMemory() makes decision without considering
> the
> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
> may be used in shuffling if default parameters were used.
> Of course, you should check the value for mapred.reduce.parallel.copies to
> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>
> About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
> has 2731 lines. So I have to guess the location of the code snippet you
> provided.
> I found this around line 1535:
>        } catch (IOException ioe) {
>          LOG.info("Failed to shuffle from " +
> mapOutputLoc.getTaskAttemptId(),
>                   ioe);
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
> Please confirm the line number.
>
> If we're looking at the same code, I am afraid I don't see how we can
> improve it. First, I assume IOException shouldn't happen that often.
> Second,
> mapOutput.discard() just sets:
>          data = null;
> for in memory case. Even if we call mapOutput.discard() before
> ramManager.unreserve(), we don't know when GC would kick in and make more
> memory available.
> Of course, given the large number of map outputs in your system, it became
> more likely that the root cause from my reasoning made OOME happen sooner.
>
> Thanks
>
> >
> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >   Ted,
> >
> >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > following.  If you would mind helping me understand I would appreciate
> it.
> >
> >   Looking at the code maxSingleShuffleLimit is only used in determining
> if
> > the copy _can_ fit into memory:
> >
> >     boolean canFitInMemory(long requestedSize) {
> >        return (requestedSize < Integer.MAX_VALUE &&
> >                requestedSize < maxSingleShuffleLimit);
> >      }
> >
> >    It also looks like the RamManager.reserve should wait until memory is
> > available so it should hit a memory limit for that reason.
> >
> >    What does seem a little strange to me is the following (
> ReduceTask.java
> > starting at 2730 ):
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> >          mapOutput = null;
> >
> >   So to me that looks like the ramManager unreserves the memory before
> the
> > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> the
> > ramManager unreserves the memory?  If the memory is unreserved before the
> > actual underlying data references are removed then it seems like another
> > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > previous memory is disposed ( mapOutput.discard() ).
> >
> >   Not sure that makes sense.  One thing to note is that the particular
> job
> > that is failing does have a good number ( 200k+ ) of map outputs.  The
> large
> > number of small map outputs may be why we are triggering a problem.
> >
> >   Thanks again for your thoughts.
> >
> >   Andy
> >
> >
> > -----Original Message-----
> > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > Sent: Sunday, March 07, 2010 1:21 PM
> > To: common-user@hadoop.apache.org
> > Cc: Andy Sautins; Ted Yu
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > Ted,
> >
> > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > some time to write a patch later this week.
> >
> > Jacob Rideout
> >
> > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > I think there is mismatch (in ReduceTask.java) between:
> > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> > > and:
> > >        maxSingleShuffleLimit = (long)(maxSize *
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > >
> > > because
> > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > >
> > > A JIRA should be filed to correlate the constant 5 above and
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > >
> > > Cheers
> > >
> > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> apache@jacobrideout.net
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are seeing the following error in our reducers of a particular job:
> > >>
> > >> Error: java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> After enough reducers fail the entire job fails. This error occurs
> > >> regardless of whether mapred.compress.map.output is true. We were able
> > >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > >> available for allocation? I would think that with poor configuration
> > >> settings (and default settings in particular) the job may not be as
> > >> efficient, but wouldn't die.
> > >>
> > >> Here is some more context in the logs, I have attached the full
> > >> reducer log here: http://gist.github.com/323746
> > >>
> > >>
> > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > >> attempt_201003060739_0002_m_000061_0
> > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > >> attempt_201003060739_0002_m_000202_0
> > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0 adding host
> > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> > >> failures
> > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > >> java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > >>
> > >> Jacob Rideout
> > >> Return Path
> > >>
> > >
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
Lowering mapred.job.shuffle.input.buffer.percent would be the option to
choose.

Maybe GC wasn't releasing memory fast enough for in memory shuffling.

On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code
> at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch
> block.  My mistake.  That's what I get for being in a rush.
>
>  For what it's worth I did re-run the job with
> mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
>  All failed with the same error:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
>   So from that it does seem like something else might be going on, yes?  I
> need to do some more research.
>
>  I appreciate your insights.
>
>  Andy
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Sunday, March 07, 2010 3:38 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> My observation is based on this call chain:
> MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> ramManager.canFitInMemory(decompressedLength)
>
> Basically ramManager.canFitInMemory() makes decision without considering
> the
> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
> may be used in shuffling if default parameters were used.
> Of course, you should check the value for mapred.reduce.parallel.copies to
> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>
> About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
> has 2731 lines. So I have to guess the location of the code snippet you
> provided.
> I found this around line 1535:
>        } catch (IOException ioe) {
>          LOG.info("Failed to shuffle from " +
> mapOutputLoc.getTaskAttemptId(),
>                   ioe);
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
> Please confirm the line number.
>
> If we're looking at the same code, I am afraid I don't see how we can
> improve it. First, I assume IOException shouldn't happen that often.
> Second,
> mapOutput.discard() just sets:
>          data = null;
> for in memory case. Even if we call mapOutput.discard() before
> ramManager.unreserve(), we don't know when GC would kick in and make more
> memory available.
> Of course, given the large number of map outputs in your system, it became
> more likely that the root cause from my reasoning made OOME happen sooner.
>
> Thanks
>
> >
> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
> >wrote:
>
> >
> >   Ted,
> >
> >   I'm trying to follow the logic in your mail and I'm not sure I'm
> > following.  If you would mind helping me understand I would appreciate
> it.
> >
> >   Looking at the code maxSingleShuffleLimit is only used in determining
> if
> > the copy _can_ fit into memory:
> >
> >     boolean canFitInMemory(long requestedSize) {
> >        return (requestedSize < Integer.MAX_VALUE &&
> >                requestedSize < maxSingleShuffleLimit);
> >      }
> >
> >    It also looks like the RamManager.reserve should wait until memory is
> > available so it should hit a memory limit for that reason.
> >
> >    What does seem a little strange to me is the following (
> ReduceTask.java
> > starting at 2730 ):
> >
> >          // Inform the ram-manager
> >          ramManager.closeInMemoryFile(mapOutputLength);
> >          ramManager.unreserve(mapOutputLength);
> >
> >          // Discard the map-output
> >          try {
> >            mapOutput.discard();
> >          } catch (IOException ignored) {
> >            LOG.info("Failed to discard map-output from " +
> >                     mapOutputLoc.getTaskAttemptId(), ignored);
> >          }
> >          mapOutput = null;
> >
> >   So to me that looks like the ramManager unreserves the memory before
> the
> > mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_
> the
> > ramManager unreserves the memory?  If the memory is unreserved before the
> > actual underlying data references are removed then it seems like another
> > thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> > previous memory is disposed ( mapOutput.discard() ).
> >
> >   Not sure that makes sense.  One thing to note is that the particular
> job
> > that is failing does have a good number ( 200k+ ) of map outputs.  The
> large
> > number of small map outputs may be why we are triggering a problem.
> >
> >   Thanks again for your thoughts.
> >
> >   Andy
> >
> >
> > -----Original Message-----
> > From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> > Sent: Sunday, March 07, 2010 1:21 PM
> > To: common-user@hadoop.apache.org
> > Cc: Andy Sautins; Ted Yu
> > Subject: Re: Shuffle In Memory OutOfMemoryError
> >
> > Ted,
> >
> > Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> > some time to write a patch later this week.
> >
> > Jacob Rideout
> >
> > On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > > I think there is mismatch (in ReduceTask.java) between:
> > >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> > > and:
> > >        maxSingleShuffleLimit = (long)(maxSize *
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> > >
> > > because
> > >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> > >
> > > A JIRA should be filed to correlate the constant 5 above and
> > > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> > >
> > > Cheers
> > >
> > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
> apache@jacobrideout.net
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are seeing the following error in our reducers of a particular job:
> > >>
> > >> Error: java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> After enough reducers fail the entire job fails. This error occurs
> > >> regardless of whether mapred.compress.map.output is true. We were able
> > >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> > >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> > >> and.ShuffleRamManager.reserve correctly detect the the memory
> > >> available for allocation? I would think that with poor configuration
> > >> settings (and default settings in particular) the job may not be as
> > >> efficient, but wouldn't die.
> > >>
> > >> Here is some more context in the logs, I have attached the full
> > >> reducer log here: http://gist.github.com/323746
> > >>
> > >>
> > >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> > >> attempt_201003060739_0002_m_000061_0
> > >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> > >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> > >> attempt_201003060739_0002_m_000202_0
> > >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0 adding host
> > >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> > >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> > >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> > >> failures
> > >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> > >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> > >> java.lang.OutOfMemoryError: Java heap space
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> > >>        at
> > >>
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> > >>
> > >>
> > >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> > >> would address the issue in 0.20.2, but it did not. Does anyone have
> > >> any comments or suggestions? Is this a bug I should file a JIRA for?
> > >>
> > >> Jacob Rideout
> > >> Return Path
> > >>
> > >
> >
>

RE: Shuffle In Memory OutOfMemoryError

Posted by Andy Sautins <an...@returnpath.net>.
  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch block.  My mistake.  That's what I get for being in a rush.

  For what it's worth I did re-run the job with mapred.reduce.parallel.copies set with values from 5 all the way down to 1.  All failed with the same error:

Error: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

  
  So from that it does seem like something else might be going on, yes?  I need to do some more research.  

  I appreciate your insights.

  Andy   

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Sunday, March 07, 2010 3:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Shuffle In Memory OutOfMemoryError

My observation is based on this call chain:
MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
ramManager.canFitInMemory(decompressedLength)

Basically ramManager.canFitInMemory() makes decision without considering the
number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
may be used in shuffling if default parameters were used.
Of course, you should check the value for mapred.reduce.parallel.copies to
see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
has 2731 lines. So I have to guess the location of the code snippet you
provided.
I found this around line 1535:
        } catch (IOException ioe) {
          LOG.info("Failed to shuffle from " +
mapOutputLoc.getTaskAttemptId(),
                   ioe);

          // Inform the ram-manager
          ramManager.closeInMemoryFile(mapOutputLength);
          ramManager.unreserve(mapOutputLength);

          // Discard the map-output
          try {
            mapOutput.discard();
          } catch (IOException ignored) {
            LOG.info("Failed to discard map-output from " +
                     mapOutputLoc.getTaskAttemptId(), ignored);
          }
Please confirm the line number.

If we're looking at the same code, I am afraid I don't see how we can
improve it. First, I assume IOException shouldn't happen that often. Second,
mapOutput.discard() just sets:
          data = null;
for in memory case. Even if we call mapOutput.discard() before
ramManager.unreserve(), we don't know when GC would kick in and make more
memory available.
Of course, given the large number of map outputs in your system, it became
more likely that the root cause from my reasoning made OOME happen sooner.

Thanks

>
On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>   Ted,
>
>   I'm trying to follow the logic in your mail and I'm not sure I'm
> following.  If you would mind helping me understand I would appreciate it.
>
>   Looking at the code maxSingleShuffleLimit is only used in determining if
> the copy _can_ fit into memory:
>
>     boolean canFitInMemory(long requestedSize) {
>        return (requestedSize < Integer.MAX_VALUE &&
>                requestedSize < maxSingleShuffleLimit);
>      }
>
>    It also looks like the RamManager.reserve should wait until memory is
> available so it should hit a memory limit for that reason.
>
>    What does seem a little strange to me is the following ( ReduceTask.java
> starting at 2730 ):
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
>          mapOutput = null;
>
>   So to me that looks like the ramManager unreserves the memory before the
> mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_ the
> ramManager unreserves the memory?  If the memory is unreserved before the
> actual underlying data references are removed then it seems like another
> thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> previous memory is disposed ( mapOutput.discard() ).
>
>   Not sure that makes sense.  One thing to note is that the particular job
> that is failing does have a good number ( 200k+ ) of map outputs.  The large
> number of small map outputs may be why we are triggering a problem.
>
>   Thanks again for your thoughts.
>
>   Andy
>
>
> -----Original Message-----
> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> Sent: Sunday, March 07, 2010 1:21 PM
> To: common-user@hadoop.apache.org
> Cc: Andy Sautins; Ted Yu
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> Ted,
>
> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> some time to write a patch later this week.
>
> Jacob Rideout
>
> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > I think there is mismatch (in ReduceTask.java) between:
> >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> > and:
> >        maxSingleShuffleLimit = (long)(maxSize *
> > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> >
> > because
> >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> >
> > A JIRA should be filed to correlate the constant 5 above and
> > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> >
> > Cheers
> >
> > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <apache@jacobrideout.net
> >wrote:
> >
> >> Hi all,
> >>
> >> We are seeing the following error in our reducers of a particular job:
> >>
> >> Error: java.lang.OutOfMemoryError: Java heap space
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >>
> >>
> >> After enough reducers fail the entire job fails. This error occurs
> >> regardless of whether mapred.compress.map.output is true. We were able
> >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> >> and.ShuffleRamManager.reserve correctly detect the the memory
> >> available for allocation? I would think that with poor configuration
> >> settings (and default settings in particular) the job may not be as
> >> efficient, but wouldn't die.
> >>
> >> Here is some more context in the logs, I have attached the full
> >> reducer log here: http://gist.github.com/323746
> >>
> >>
> >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> >> attempt_201003060739_0002_m_000061_0
> >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> >> attempt_201003060739_0002_m_000202_0
> >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003060739_0002_r_000000_0 adding host
> >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> >> failures
> >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> >> java.lang.OutOfMemoryError: Java heap space
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >>
> >>
> >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> >> would address the issue in 0.20.2, but it did not. Does anyone have
> >> any comments or suggestions? Is this a bug I should file a JIRA for?
> >>
> >> Jacob Rideout
> >> Return Path
> >>
> >
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
My observation is based on this call chain:
MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
ramManager.canFitInMemory(decompressedLength)

Basically ramManager.canFitInMemory() makes decision without considering the
number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
may be used in shuffling if default parameters were used.
Of course, you should check the value for mapred.reduce.parallel.copies to
see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
has 2731 lines. So I have to guess the location of the code snippet you
provided.
I found this around line 1535:
        } catch (IOException ioe) {
          LOG.info("Failed to shuffle from " +
mapOutputLoc.getTaskAttemptId(),
                   ioe);

          // Inform the ram-manager
          ramManager.closeInMemoryFile(mapOutputLength);
          ramManager.unreserve(mapOutputLength);

          // Discard the map-output
          try {
            mapOutput.discard();
          } catch (IOException ignored) {
            LOG.info("Failed to discard map-output from " +
                     mapOutputLoc.getTaskAttemptId(), ignored);
          }
Please confirm the line number.

If we're looking at the same code, I am afraid I don't see how we can
improve it. First, I assume IOException shouldn't happen that often. Second,
mapOutput.discard() just sets:
          data = null;
for in memory case. Even if we call mapOutput.discard() before
ramManager.unreserve(), we don't know when GC would kick in and make more
memory available.
Of course, given the large number of map outputs in your system, it became
more likely that the root cause from my reasoning made OOME happen sooner.

Thanks

>
On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <an...@returnpath.net>wrote:

>
>   Ted,
>
>   I'm trying to follow the logic in your mail and I'm not sure I'm
> following.  If you would mind helping me understand I would appreciate it.
>
>   Looking at the code maxSingleShuffleLimit is only used in determining if
> the copy _can_ fit into memory:
>
>     boolean canFitInMemory(long requestedSize) {
>        return (requestedSize < Integer.MAX_VALUE &&
>                requestedSize < maxSingleShuffleLimit);
>      }
>
>    It also looks like the RamManager.reserve should wait until memory is
> available so it should hit a memory limit for that reason.
>
>    What does seem a little strange to me is the following ( ReduceTask.java
> starting at 2730 ):
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
>          mapOutput = null;
>
>   So to me that looks like the ramManager unreserves the memory before the
> mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_ the
> ramManager unreserves the memory?  If the memory is unreserved before the
> actual underlying data references are removed then it seems like another
> thread can try to allocate memory ( ReduceTask.java:2730 ) before the
> previous memory is disposed ( mapOutput.discard() ).
>
>   Not sure that makes sense.  One thing to note is that the particular job
> that is failing does have a good number ( 200k+ ) of map outputs.  The large
> number of small map outputs may be why we are triggering a problem.
>
>   Thanks again for your thoughts.
>
>   Andy
>
>
> -----Original Message-----
> From: Jacob R Rideout [mailto:apache@jacobrideout.net]
> Sent: Sunday, March 07, 2010 1:21 PM
> To: common-user@hadoop.apache.org
> Cc: Andy Sautins; Ted Yu
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> Ted,
>
> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
> some time to write a patch later this week.
>
> Jacob Rideout
>
> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> > I think there is mismatch (in ReduceTask.java) between:
> >      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> > and:
> >        maxSingleShuffleLimit = (long)(maxSize *
> > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
> >
> > because
> >      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> > so the total memory allocated for in-mem shuffle is 1.25 * maxSize
> >
> > A JIRA should be filed to correlate the constant 5 above and
> > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
> >
> > Cheers
> >
> > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <apache@jacobrideout.net
> >wrote:
> >
> >> Hi all,
> >>
> >> We are seeing the following error in our reducers of a particular job:
> >>
> >> Error: java.lang.OutOfMemoryError: Java heap space
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >>
> >>
> >> After enough reducers fail the entire job fails. This error occurs
> >> regardless of whether mapred.compress.map.output is true. We were able
> >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> >> and.ShuffleRamManager.reserve correctly detect the the memory
> >> available for allocation? I would think that with poor configuration
> >> settings (and default settings in particular) the job may not be as
> >> efficient, but wouldn't die.
> >>
> >> Here is some more context in the logs, I have attached the full
> >> reducer log here: http://gist.github.com/323746
> >>
> >>
> >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> >> attempt_201003060739_0002_m_000061_0
> >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> >> attempt_201003060739_0002_m_000202_0
> >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003060739_0002_r_000000_0 adding host
> >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> >> failures
> >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> >> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> >> java.lang.OutOfMemoryError: Java heap space
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
> >>        at
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
> >>
> >>
> >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> >> would address the issue in 0.20.2, but it did not. Does anyone have
> >> any comments or suggestions? Is this a bug I should file a JIRA for?
> >>
> >> Jacob Rideout
> >> Return Path
> >>
> >
>

RE: Shuffle In Memory OutOfMemoryError

Posted by Andy Sautins <an...@returnpath.net>.
   Ted,

   I'm trying to follow the logic in your mail and I'm not sure I'm following.  If you would mind helping me understand I would appreciate it.  

   Looking at the code maxSingleShuffleLimit is only used in determining if the copy _can_ fit into memory:

     boolean canFitInMemory(long requestedSize) {
        return (requestedSize < Integer.MAX_VALUE &&
                requestedSize < maxSingleShuffleLimit);
      }

    It also looks like the RamManager.reserve should wait until memory is available so it should hit a memory limit for that reason. 

    What does seem a little strange to me is the following ( ReduceTask.java starting at 2730 ):

          // Inform the ram-manager
          ramManager.closeInMemoryFile(mapOutputLength);
          ramManager.unreserve(mapOutputLength);

          // Discard the map-output
          try {
            mapOutput.discard();
          } catch (IOException ignored) {
            LOG.info("Failed to discard map-output from " +
                     mapOutputLoc.getTaskAttemptId(), ignored);
          }
          mapOutput = null;

   So to me that looks like the ramManager unreserves the memory before the mapOutput is discarded.  Shouldn't the mapOutput be discarded _before_ the ramManager unreserves the memory?  If the memory is unreserved before the actual underlying data references are removed then it seems like another thread can try to allocate memory ( ReduceTask.java:2730 ) before the previous memory is disposed ( mapOutput.discard() ).  

   Not sure that makes sense.  One thing to note is that the particular job that is failing does have a good number ( 200k+ ) of map outputs.  The large number of small map outputs may be why we are triggering a problem.

   Thanks again for your thoughts.

   Andy


-----Original Message-----
From: Jacob R Rideout [mailto:apache@jacobrideout.net] 
Sent: Sunday, March 07, 2010 1:21 PM
To: common-user@hadoop.apache.org
Cc: Andy Sautins; Ted Yu
Subject: Re: Shuffle In Memory OutOfMemoryError

Ted,

Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
some time to write a patch later this week.

Jacob Rideout

On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> I think there is mismatch (in ReduceTask.java) between:
>      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> and:
>        maxSingleShuffleLimit = (long)(maxSize *
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>
> because
>      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>
> A JIRA should be filed to correlate the constant 5 above and
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>
> Cheers
>
> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <ap...@jacobrideout.net>wrote:
>
>> Hi all,
>>
>> We are seeing the following error in our reducers of a particular job:
>>
>> Error: java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> After enough reducers fail the entire job fails. This error occurs
>> regardless of whether mapred.compress.map.output is true. We were able
>> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
>> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
>> and.ShuffleRamManager.reserve correctly detect the the memory
>> available for allocation? I would think that with poor configuration
>> settings (and default settings in particular) the job may not be as
>> efficient, but wouldn't die.
>>
>> Here is some more context in the logs, I have attached the full
>> reducer log here: http://gist.github.com/323746
>>
>>
>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>> attempt_201003060739_0002_m_000061_0
>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>> attempt_201003060739_0002_m_000202_0
>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0 adding host
>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
>> failures
>> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>> java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
>> would address the issue in 0.20.2, but it did not. Does anyone have
>> any comments or suggestions? Is this a bug I should file a JIRA for?
>>
>> Jacob Rideout
>> Return Path
>>
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Jacob R Rideout <ap...@jacobrideout.net>.
Ted,

Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
some time to write a patch later this week.

Jacob Rideout

On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yu...@gmail.com> wrote:
> I think there is mismatch (in ReduceTask.java) between:
>      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> and:
>        maxSingleShuffleLimit = (long)(maxSize *
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>
> because
>      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>
> A JIRA should be filed to correlate the constant 5 above and
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>
> Cheers
>
> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <ap...@jacobrideout.net>wrote:
>
>> Hi all,
>>
>> We are seeing the following error in our reducers of a particular job:
>>
>> Error: java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> After enough reducers fail the entire job fails. This error occurs
>> regardless of whether mapred.compress.map.output is true. We were able
>> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
>> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
>> and.ShuffleRamManager.reserve correctly detect the the memory
>> available for allocation? I would think that with poor configuration
>> settings (and default settings in particular) the job may not be as
>> efficient, but wouldn't die.
>>
>> Here is some more context in the logs, I have attached the full
>> reducer log here: http://gist.github.com/323746
>>
>>
>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>> attempt_201003060739_0002_m_000061_0
>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>> attempt_201003060739_0002_m_000202_0
>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0 adding host
>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
>> failures
>> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>> java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
>> would address the issue in 0.20.2, but it did not. Does anyone have
>> any comments or suggestions? Is this a bug I should file a JIRA for?
>>
>> Jacob Rideout
>> Return Path
>>
>

Re: Shuffle In Memory OutOfMemoryError

Posted by Ted Yu <yu...@gmail.com>.
I think there is mismatch (in ReduceTask.java) between:
      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
and:
        maxSingleShuffleLimit = (long)(maxSize *
MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

because
      copiers = new ArrayList<MapOutputCopier>(numCopiers);
so the total memory allocated for in-mem shuffle is 1.25 * maxSize

A JIRA should be filed to correlate the constant 5 above and
MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

Cheers

On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <ap...@jacobrideout.net>wrote:

> Hi all,
>
> We are seeing the following error in our reducers of a particular job:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
> After enough reducers fail the entire job fails. This error occurs
> regardless of whether mapred.compress.map.output is true. We were able
> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> and.ShuffleRamManager.reserve correctly detect the the memory
> available for allocation? I would think that with poor configuration
> settings (and default settings in particular) the job may not be as
> efficient, but wouldn't die.
>
> Here is some more context in the logs, I have attached the full
> reducer log here: http://gist.github.com/323746
>
>
> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> attempt_201003060739_0002_m_000061_0
> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> attempt_201003060739_0002_m_000202_0
> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_201003060739_0002_r_000000_0 adding host
> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> failures
> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> would address the issue in 0.20.2, but it did not. Does anyone have
> any comments or suggestions? Is this a bug I should file a JIRA for?
>
> Jacob Rideout
> Return Path
>