You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Greg Roelofs (JIRA)" <ji...@apache.org> on 2010/08/28 07:18:53 UTC

[jira] Created: (MAPREDUCE-2041) TaskRunner logDir race condition leads to crash on job-acl.xml creation

TaskRunner logDir race condition leads to crash on job-acl.xml creation
-----------------------------------------------------------------------

                 Key: MAPREDUCE-2041
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.22.0
         Environment: Linux/x86-64, 32-bit Java, NFS source tree
            Reporter: Greg Roelofs


TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  It also fails even to check the return value of setPermissions().  Either one can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except with C = "creation"), in which case the subsequent creation of job-acl.xml in writeJobACLs() will also fail, killing the task:

{noformat}
2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress (TaskInProgress.java:updateStatus(591)) - Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
Caused by: java.io.FileNotFoundException: /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml (No such file or directory)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
    at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
    at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
{noformat}

This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml failure always seems to affect host2 - and to do so more quickly than the intentional exception on host1 - which triggers an assertion failure due to the wrong host being job-blacklisted.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.