You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2011/08/19 01:00:28 UTC

[jira] [Commented] (MAPREDUCE-2846) a small % of all tasks fail with DefaultTaskController

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087374#comment-13087374 ] 

Owen O'Malley commented on MAPREDUCE-2846:
------------------------------------------

Offline, Allen gave me a stack trace:

{quote}
java.io.FileNotFoundException: File /export/apps/hadoop/hadoop-0.20.204.0/logs/userlogs/job_201108100052_0008/attempt_201108100052_0008_r_000145_0/log.tmp does not exist.
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:210)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:160)
	at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:261)
	at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:406)
	at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:345)
	at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:391)
	at org.apache.hadoop.mapred.Child.main(Child.java:235)
{quote}

Based on this, I discovered that there is a missing synchronization in writeToIndexFile. This seems to reduce the failures that Allen is seeing.

> a small % of all tasks fail with DefaultTaskController
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-2846
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2846
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task, task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Allen Wittenauer
>            Priority: Blocker
>
> After upgrading our test 0.20.203 grid to 0.20.204-rc2, we ran terasort to verify operation.  While the job completed successfully, approx 10% of the tasks failed with task runner execution errors and the inability to create symlinks for attempt logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira