You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Esteban Gutierrez (JIRA)" <ji...@apache.org> on 2011/06/14 20:58:47 UTC

[jira] [Commented] (MAPREDUCE-2592) TT should fail task immediately if userlog dir cannot be created

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049336#comment-13049336 ] 

Esteban Gutierrez commented on MAPREDUCE-2592:
----------------------------------------------

The problem propagates very quickly to all the nodes after a single TaskTracker has reached that state and more jobs are submitted. This problem can bring down the whole cluster since all the TT will be blacklisted.

A sample stacktrace:

11/02/05 10:00:01 WARN mapred.JobClient: Error reading task outputhttp://dn:50060/tasklog?plaintext=true&taskid=attempt_201102050901_1000_m_000001_0&filter=stderr 
11/02/05 10:00:02 INFO mapred.JobClient: Task Id : attempt_201102050901_1000_m_000001_0, Status : FAILED 
java.lang.Throwable: Child Error 
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:471) 
Caused by: java.io.IOException: Task process exit with nonzero status of 1. 
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:458)



> TT should fail task immediately if userlog dir cannot be created
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-2592
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2592
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: tasktracker
>    Affects Versions: 0.23.0
>            Reporter: Todd Lipcon
>             Fix For: 0.23.0
>
>
> Currently, TaskRunner will log the message "mkdirs failed. Ignoring" if it fails to mkdir the userlog directory for a task. Then, it goes on to spawn taskjvm.sh which tries to redirect output into the userlogs dir, thus failing with exit code 1. This leads to error messages that are very hard to diagnose ("task failed with exit status 1") in cases where the userlog directory has either become inaccessible or has reached the maximum number of dirents (32000 in ext3)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira