You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Greg Roelofs (JIRA)" <ji...@apache.org> on 2010/08/28 07:18:53 UTC
[jira] Created: (MAPREDUCE-2041) TaskRunner logDir race condition
leads to crash on job-acl.xml creation
TaskRunner logDir race condition leads to crash on job-acl.xml creation
-----------------------------------------------------------------------
Key: MAPREDUCE-2041
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: task
Affects Versions: 0.22.0
Environment: Linux/x86-64, 32-bit Java, NFS source tree
Reporter: Greg Roelofs
TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them. It also fails even to check the return value of setPermissions(). Either one can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except with C = "creation"), in which case the subsequent creation of job-acl.xml in writeJobACLs() will also fail, killing the task:
{noformat}
2010-08-26 20:18:10,334 INFO mapred.TaskInProgress (TaskInProgress.java:updateStatus(591)) - Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
Caused by: java.io.FileNotFoundException: /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
{noformat}
This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml failure always seems to affect host2 - and to do so more quickly than the intentional exception on host1 - which triggers an assertion failure due to the wrong host being job-blacklisted.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.