You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Sanjay Dahiya (JIRA)" <ji...@apache.org> on 2006/11/20 13:47:02 UTC

[jira] Created: (HADOOP-737) TaskTracker's job cleanup loop should check for finished job before deleting local directories

TaskTracker's job cleanup loop should check for finished job before deleting local directories
-----------------------------------------------------------------------------------------------

Key: HADOOP-737
URL: http://issues.apache.org/jira/browse/HADOOP-737
Project: Hadoop
Issue Type: Bug
Components: mapred
Reporter: Sanjay Dahiya
Assigned To: Sanjay Dahiya
Priority: Critical

TaskTracker uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.

Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions.

Possible solutions :
1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory.
2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there.

There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work.

Comments ?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-737) TaskTracker's job cleanup loop should check for finished job before deleting local directories

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-737?page=all ]

Arun C Murthy resolved HADOOP-737.
----------------------------------

    Fix Version/s: 0.10.0
       Resolution: Fixed

Fixed as a part of HADOOP-639.

> TaskTracker's job cleanup loop should check for finished job before deleting local directories
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-737
>                 URL: http://issues.apache.org/jira/browse/HADOOP-737
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Sanjay Dahiya
>         Assigned To: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.10.0
>
>
> TaskTracker  uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.
>  
> Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions. 
> Possible solutions : 
> 1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory. 
> 2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there. 
> There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work. 
> Comments ? 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-737) TaskTracker's job cleanup loop should check for finished job before deleting local directories

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-737?page=all ]

Arun C Murthy reassigned HADOOP-737:
------------------------------------

    Assignee: Arun C Murthy  (was: Sanjay Dahiya)

> TaskTracker's job cleanup loop should check for finished job before deleting local directories
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-737
>                 URL: http://issues.apache.org/jira/browse/HADOOP-737
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Sanjay Dahiya
>         Assigned To: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.10.0
>
>
> TaskTracker  uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.
>  
> Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions. 
> Possible solutions : 
> 1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory. 
> 2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there. 
> There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work. 
> Comments ? 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira