You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Kevin Beyer (JIRA)" <ji...@apache.org> on 2013/10/12 04:03:45 UTC

[jira] [Created] (MAPREDUCE-5580) OutOfMemoryError in ReduceTask shuffleInMemory

Kevin Beyer created MAPREDUCE-5580:
--------------------------------------

             Summary: OutOfMemoryError in ReduceTask shuffleInMemory
                 Key: MAPREDUCE-5580
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5580
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.20.2
            Reporter: Kevin Beyer


I have had several reduce tasks fail during the shuffle phase with the following error and stack trace (on CHD 4.1.2):

Error: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1644)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1504)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1339)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1271)

I found many web posts that report the same problem and a prior hadoop issue that is already fixed (that one involved a int overflow problem). 

The task had 1 GB of java heap and the mapred.job.shuffle.input.buffer.percent parameter in mapred-site.xml was set to the default of 0.7.  This mean that 1 GB * 0.7 = 717 MB of java heap will hold the map outputs that are no bigger than 717 / 4 = 179 MB.

We were able to capture a heap dump of one reduce task.  The heap contained 8 byte arrays that were 127 MB each.  These byte arrays were all referenced by their own DataInputBuffer.  Six of the buffers were referenced by the linked lists in ReduceTask$ReduceCopier.mapOutputsFilesInMemory.  These six byte arrays consume 127 MB * 6 = 762 MB of the heap.  Curiously, this 762 MB exceeds the 717 MB limit.  The ShuffleRamManager.fullSize = 797966777 = 761MB, so something is a bit off in my original value of 717...  But this is not the major source of trouble.

There are two more large byte arrays of 127 MB * 2 = 254 MB that are still in memory.  These are referenced from DataInputBuffers that are referenced indirectly by the static Merger.MergeQueue instance.  

One of these is referenced twice by the 'key' and 'value' fields of the MergeQueue.  These fields store the current minimum key and value by pointing at the full byte array of the map output and a range of a few bytes in that array.  These fields are needed during the active merge process, but not needed when the merge is complete.  In my heap dump, the 'segments' list has been cleared, so no active merge is in progress.  However, the 'key' and 'value' are still set from the last merge pass.  This pins one in-memory map output in memory, which can be as big as 0.7 / 4 = 17.5% of memory with default settings.  When a merge phase is complete, these two fields should be set null.

The second byte array is referenced via the MergeQueue.comparator RawComparator.  In my case, this is a WritableComparator. This is most likely caused by this method:

  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
    try {
      buffer.reset(b1, s1, l1);                   // parse key1
      key1.readFields(buffer);
      
      buffer.reset(b2, s2, l2);                   // parse key2
      key2.readFields(buffer);
      
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
    
    return compare(key1, key2);                   // compare them
  }

This causes the comparator to remember the last 'b2' byte array passed into compare().  This byte array could be an in-memory map output, which by default is 0.7/4 = 17.5% of memory.  This code could have a finally { buffer.clear() } to drop the reference.  Alternatively, the API could include a reset() call to clear such unnecessary state.

Given this information, we can see why we can easily cause an OOM error:  By default we have 70% of ram dedicated to map output, and we can have 17.5 * 2 = 35% of memory unaccounted for by the two referenced described.  Even without accounting for any other memory overhead, we already have 70% + 35% = 105% of ram occupied in the unlucky case that these two references are pointing at the largest possible in-memory map outputs.

There may be other leakage of these byte arrays, but these were all the large byte arrays in my heap dump.  A test that makes many map outputs that are 0.7 / 4 = 17.5% of the reduce task heap can reliably recreate this problem and perhaps find other unaccounted large byte arrays.




--
This message was sent by Atlassian JIRA
(v6.1#6144)