You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/10/09 19:31:51 UTC

[jira] Created: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Race condition in removing a KILLED task from tasktracker
---------------------------------------------------------

                 Key: HADOOP-2016
                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Devaraj Das
            Priority: Critical
             Fix For: 0.15.0


I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533655 ] 

Devaraj Das commented on HADOOP-2016:
-------------------------------------

When this happens, the temp files for the task are never cleared.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-2016:
----------------------------------

    Attachment: HADOOP-2016_20071011.patch

First cut - relatively straight-forward; I'll continue testing.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-2016:
----------------------------------

    Attachment: HADOOP-2016_2_20071011.patch

I figured that a less intrusive approach to just ignore the child's status-update and not asking it to kill itself right-away would work as well. A more conservation option.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-2016:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533863 ] 

Arun C Murthy commented on HADOOP-2016:
---------------------------------------

Here are relevant logs:

{noformat}
1. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: task_200710090910_0003_r_001792_1
2. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: task_200710090910_0003_r_001792_1
3. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskTracker: task_200710090910_0003_r_001792_1 0.67524564% reduce > reduce
4. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskRunner: task_200710090910_0003_r_001792_1 done; removing files.
5. 2007-10-09 12:19:15,491 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child task finshed: task_200710090910_0003_r_001792_1. Ignored.
6. 2007-10-09 12:19:18,059 WARN org.apache.hadoop.mapred.TaskTracker: Progress from unknown child task: task_200710090910_0003_r_001792_1
{noformat}

With particular emphasis on line #3 above, it looks like this can happen due to the fact that a task's progress update (child-vm) got interspersed with methods which were called while purging the task i.e. 
{{TaskTracker#purgeTask}} -> {{TaskTracker#TaskInProgress#jobHasFinished}} which then calls {{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}.

Unfortunately there are a couple of issues which result in this scenario:
a) {{TaskTracker#TaskInProgress#jobHasFinished}} isn't a synchronized method and hence there is no transaction semantics for calls made from there i.e. {{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}. 
b) Thus the call to kill and clean can be interspersed with a call to {{TaskTracker#TaskInProgress#reportProgress}} (as seen in the logs). This is dangerous since it is the *{{TaskTracker#TaskInProgress#cleanup}}* call which removes the taskid from {{TaskTracker#tasks}}.
c) {{TaskTracker#TaskInProgress#reportProgress}} unconditionally marks the task's run-state as {{RUNNING}} which clearly is wrong, since it overwrites the task's {{KILLED}} status set in {{TaskTracker#TaskInProgress#kill}}.

Overall a combination of the above leads to the task never being removed from {{TaskTracker#runningTasks}} which leads to the bug in question.

The way to get around is to:
a) Call {{tasks.remove(taskid)}} from {{TaskTracker#TaskInProgress#kill}} to ensure the interspersed call to {{TaskTracker#TaskInProgress#reportProgress}} fails to wrongly update the task status as {{RUNNING}}
or
b) Check to ensure the task's state is actually {{RUNNING}} before updating it's status when the child reports in.

I'd go with (b).


> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-2016:
--------------------------------

    Priority: Blocker  (was: Critical)

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534525 ] 

Hudson commented on HADOOP-2016:
--------------------------------

Integrated in Hadoop-Nightly #270 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/270/])

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534111 ] 

Owen O'Malley commented on HADOOP-2016:
---------------------------------------

+1, although moving the logging line up seems more confusing than helpful.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-2016:
----------------------------------

    Affects Version/s: 0.14.2
               Status: Patch Available  (was: Open)

Submitting patch, I've run this multiple times and checked that there are no tell-tale signs of stray _${taskid} directories lying in ${mapred.output.dir}.

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2016_20071011.patch, HADOOP-2016_2_20071011.patch
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-2016) Race condition in removing a KILLED task from tasktracker

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned HADOOP-2016:
-------------------------------------

    Assignee: Arun C Murthy

> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker and the relevant TaskTracker got the right KillTaskAction, but the tasktracker continued to hold a reference to that task (although the task jvm was killed). The task continued to be in RUNNING state in both the JobTracker and that TaskTracker for ever. I suspect there is some race condition in reading/updating datastructures inside the taskCleanupThread & transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.