You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "BELUGA BEHR (JIRA)" <ji...@apache.org> on 2019/02/06 03:29:00 UTC

[jira] [Comment Edited] (MAPREDUCE-7180) Relaunching Failed Containers

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761409#comment-16761409 ] 

BELUGA BEHR edited comment on MAPREDUCE-7180 at 2/6/19 3:28 AM:
----------------------------------------------------------------

Thanks for the clarification [~wilfreds] I see where you are coming from.

How confident are we that the 80/20 split will cover all cases? Are there any stats to back this up?  Really the biggest thing is when an application fails with OOM - Mapper or Reducer needs more heap, enlarging the container size will automatically increase the heap size at 80/20 split.  


was (Author: belugabehr):
The two classes of errors are:

# Application fails with OOM - Mapper or Reducer needs more heap, enlarging the container size will automatically increase the heap size at 80/20 split
# Overhead size of application is large

The first one is the most common.  

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a single Mapper/Reducer container is using more memory than has been reserved in YARN.  The following message is logging the the MapReduce ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it is killed again for the same reason.  This process happens three (maybe four?) times before the entire MapReduce job fails.  It's often said that the definition of insanity is doing the same thing over and over and expecting different results.
> For all intents and purposes, the amount of resources requested by Mappers and Reducers is a fixed amount; based on the default configuration values.  Users can set the memory on a per-job basis, but it's a pain, not exact, and requires intimate knowledge of the MapReduce framework and its memory usage patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed because of this specific memory resource constraint, that it requests a larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the container fails and the task is retried.  This will prevent many Job failures and allow for additional memory tuning, per-Job, after the fact, to get better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org