You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/05/03 19:24:15 UTC

[jira] Created: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

FSError encountered by one running task should not be fatal to other tasks on that node
---------------------------------------------------------------------------------------

                 Key: HADOOP-1324
                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das


Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1324:
----------------------------------

    Affects Version/s: 0.12.3
               Status: Patch Available  (was: Open)

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494259 ] 

Hadoop QA commented on HADOOP-1324:
-----------------------------------

Integrated in Hadoop-Nightly #82 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/82/)

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>             Fix For: 0.13.0
>
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494004 ] 

Hadoop QA commented on HADOOP-1324:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356848/HADOOP-1324_20070507_1.patch applied and successfully tested against trunk revision r534975.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/120/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/120/console

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1324:
----------------------------------

    Attachment: HADOOP-1324_20070507_1.patch

Simple fix:
On receipt of FSError from the child-vm, the task-tracker now just kills the task instead of re-initing itself - the idea is that with sufficient no. of task-failures on the same tracker it will get black-listed for the job, no new tasks will get assigned to it and things should swim along...

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1324:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Arun!

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>             Fix For: 0.13.0
>
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned HADOOP-1324:
-------------------------------------

    Assignee: Arun C Murthy

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1324) FSError encountered by one running task should not be fatal to other tasks on that node

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494068 ] 

Devaraj Das commented on HADOOP-1324:
-------------------------------------

+1

> FSError encountered by one running task should not be fatal to other tasks on that node
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1324
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Arun C Murthy
>         Attachments: HADOOP-1324_20070507_1.patch
>
>
> Currently, if one task encounters a FSError, it reports that to the TaskTracker and the TaskTracker reinitializes itself and effectively loses state of all the other running tasks too. This can probably be improved especially after the fix for HADOOP-1252. The TaskTracker should probably avoid reinitializing itself and instead get blacklisted for that job. Other tasks should be allowed to continue as long as they can (complete successfully, or, fail either due to disk problems or otherwise).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.