You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "John Elliott (JIRA)" <ji...@apache.org> on 2013/01/18 15:36:13 UTC
[jira] [Created] (MAPREDUCE-4947) Random task failures during TeraSort job

John Elliott created MAPREDUCE-4947:
---------------------------------------

             Summary: Random task failures during TeraSort job
                 Key: MAPREDUCE-4947
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4947
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 1.0.1, 1.0.0, 0.20.205.0
         Environment: RHEL 6.2
4 datanodes
    one xfs filesystem per datanode
    2 quad core CPU's per datanode
    48 GB memory per datanode
10GbE node interconnect
jdk1.6.0_32
            Reporter: John Elliott
            Priority: Minor


During most of my terasort jobs, I see occasional, random map task failures during the reduce phase.  Usually there will be only 1-4 task failures during a job, with the job completing successfully.  On rare occasions, a tasktracker will be blacklisted.  Below are the usual error messages:
========================================
NFO mapred.JobClient: Task Id : attempt_201301151521_0002_m_005954_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stdout
WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stderr
==========================================
Tasktracker nodes are considered for 8 map and 7 reduce slots each for a total of 32 map slots and 28 reduce slots for the 4 datanode cluster.

The problem never occurs, during teragen jobs and only occur after reduce copies start.  Cutting the number of slots in half helps to reduce the frequency, but the problem still occurs.


Actions taken without any success:
ulimit increases for nproc and nofile to 32768 and then 65536
setting MALLOC_ARENA_MAX=4 in the hadoop-env.sh file per HADOOP-7154.
heapsize increases and reductions
reduction of map and reduce slots as stated above
various modifications of mapreduce and hdfs properties

I've done quite a bit of testing with CDH3 on the same hardware and not encountered this problem, so I suspect there may be a bug fix or patch I'm missing.  Any suggestions for further isolating the problem or application of patches would be much appreciated.

Thanks in advance!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira