You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/05/10 09:11:32 UTC
[Hadoop Wiki] Update of "NextGenMapReduceDevTesting" by Arun C Murthy
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "NextGenMapReduceDevTesting" page has been changed by Arun C Murthy.
The comment on this change is: v1.
http://wiki.apache.org/hadoop/NextGenMapReduceDevTesting
--------------------------------------------------
New page:
This wiki tracks developer-testing for NextGenMapReduce.
This aim of this document is to capture various failure handling scenarios for !MapReduce applications running under YARN and the YARN framework itself.
=== Failure scenarios ===
==== User task error ====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
==== User task error, same task fails 4 times ====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
|| AM fails the !MapReduce job and exits || || ||
==== Container failure ====
===== Localization error =====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
===== Exceeding memory or disk limits =====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || || ||
|| !CapacityScheduler releases resources for queue, user and application || ||
|| RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
===== Lost map output or faulty NM Netty =====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| Reduces report shuffle failure errors to AM || || ||
|| On sufficient fetch-failure notifications the AM re-runs map || || ||
===== User fails/kills map or reduce task =====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM is immediately notified of error by NM with appropriate error code/status-msg || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
||RM notifies AM about status (including error code) of the container || || ||
|| AM fails the task attempt || || ||
|| AM re-runs task-attempt before other 'virgin' tasks on a _different node_ || || ||
==== Node failure due to timeout or health-check error ====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM fails all running containers and informs appropriate AMs || || ||
|| Shuffle failures for completed map containers... handled (aggressively?) by AM || || ||
|| AM re-runs running task-attempts and completed maps || || ||
==== !MapReduce AM failure ====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| NM notifies RM || || ||
|| !CapacityScheduler releases resources for queue, user and application || || ||
|| ASM recognises AM failure || || ||
|| ASM kills all running containers || || ||
|| ASM restarts !MapReduce AM || || ||
|| !MapReduce AM recovers and re-runs only non-complete tasks || || ||
==== !ResourceManager bounce ====
|| '''Corrective measures''' || '''Developer(s) verifying the corrective measures''' || '''Date(s)''' ||
|| RM recovers all running AMs || || ||
|| RM recovers all running containers || || ||
|| RM rebuilds !CapacityScheduler queue & user capacities || || ||
|| !MapReduce AMs re-runs only non-complete tasks || || ||