You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/10/09 19:31:51 UTC
[jira] Created: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Race condition in removing a KILLED task from tasktracker
---------------------------------------------------------
Key: HADOOP-2016
URL: https://issues.apache.org/jira/browse/HADOOP-2016
Project: Hadoop
Issue Type: Bug
Components: mapred
Reporter: Devaraj Das
Priority: Critical
Fix For: 0.15.0
I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533655 ]
Devaraj Das commented on HADOOP-2016:
-------------------------------------
When this happens, the temp files for the task are never cleared.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Priority: Blocker
> Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-2016:
----------------------------------
Attachment: HADOOP-2016_20071011.patch
First cut - relatively straight-forward; I'll continue testing.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-2016:
----------------------------------
Attachment: HADOOP-2016_2_20071011.patch
I figured that a less intrusive approach to just ignore the child's status-update and not asking it to kill itself right-away would work as well. A more conservation option.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-2016:
----------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
I just committed this.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.2
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533863 ]
Arun C Murthy commented on HADOOP-2016:
---------------------------------------
Here are relevant logs:
{noformat}
1. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: task_200710090910_0003_r_001792_1
2. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: task_200710090910_0003_r_001792_1
3. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskTracker: task_200710090910_0003_r_001792_1 0.67524564% reduce > reduce
4. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskRunner: task_200710090910_0003_r_001792_1 done; removing files.
5. 2007-10-09 12:19:15,491 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child task finshed: task_200710090910_0003_r_001792_1. Ignored.
6. 2007-10-09 12:19:18,059 WARN org.apache.hadoop.mapred.TaskTracker: Progress from unknown child task: task_200710090910_0003_r_001792_1
{noformat}
With particular emphasis on line #3 above, it looks like this can happen due to the fact that a task's progress update (child-vm) got interspersed with methods which were called while purging the task i.e.
{{TaskTracker#purgeTask}} -> {{TaskTracker#TaskInProgress#jobHasFinished}} which then calls {{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}.
Unfortunately there are a couple of issues which result in this scenario:
a) {{TaskTracker#TaskInProgress#jobHasFinished}} isn't a synchronized method and hence there is no transaction semantics for calls made from there i.e. {{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}.
b) Thus the call to kill and clean can be interspersed with a call to {{TaskTracker#TaskInProgress#reportProgress}} (as seen in the logs). This is dangerous since it is the *{{TaskTracker#TaskInProgress#cleanup}}* call which removes the taskid from {{TaskTracker#tasks}}.
c) {{TaskTracker#TaskInProgress#reportProgress}} unconditionally marks the task's run-state as {{RUNNING}} which clearly is wrong, since it overwrites the task's {{KILLED}} status set in {{TaskTracker#TaskInProgress#kill}}.
Overall a combination of the above leads to the task never being removed from {{TaskTracker#runningTasks}} which leads to the bug in question.
The way to get around is to:
a) Call {{tasks.remove(taskid)}} from {{TaskTracker#TaskInProgress#kill}} to ensure the interspersed call to {{TaskTracker#TaskInProgress#reportProgress}} fails to wrongly update the task status as {{RUNNING}}
or
b) Check to ensure the task's state is actually {{RUNNING}} before updating it's status when the child reports in.
I'd go with (b).
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das updated HADOOP-2016:
--------------------------------
Priority: Blocker (was: Critical)
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Priority: Blocker
> Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534525 ]
Hudson commented on HADOOP-2016:
--------------------------------
Integrated in Hadoop-Nightly #270 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/270/])
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.2
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534111 ]
Owen O'Malley commented on HADOOP-2016:
---------------------------------------
+1, although moving the logging line up seems more confusing than helpful.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-2016:
----------------------------------
Affects Version/s: 0.14.2
Status: Patch Available (was: Open)
Submitting patch, I've run this multiple times and checked that there are no tell-tale signs of stray _${taskid} directories lying in ${mapred.output.dir}.
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.2
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-2016) Race condition in removing a KILLED
task from tasktracker
Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy reassigned HADOOP-2016:
-------------------------------------
Assignee: Arun C Murthy
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.