You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/07/10 13:55:06 UTC

[jira] Created: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
----------------------------------------------------------------------------------------------------------

                 Key: HADOOP-1586
                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Devaraj Das
            Assignee: Devaraj Das
             Fix For: 0.14.0


Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1586:
--------------------------------

    Attachment:     (was: 1586.patch)

> Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
> One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511490 ] 

Devaraj Das commented on HADOOP-1586:
-------------------------------------

i discovered an issue with the patch. Removing it and will submit another soon.

> Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
> One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511575 ] 

Vivek Ratan commented on HADOOP-1586:
-------------------------------------

Do we know why SocketTimeoutException is being thrown? Is the TT too busy responding to the call? How about increasing the socket timeout? I'm not sure you want to treat SocketConnectionTimeout and ConectException differently. What if the TT is hung, so that the former is thrown but not the latter - it might make sense for the Task to realize that and kill itself after 3 tries. 

> Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
> One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das resolved HADOOP-1586.
---------------------------------

    Resolution: Won't Fix

This issue is handled better in the related issue - HADOOP-1651

> Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
> One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1586) Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1586:
--------------------------------

    Attachment: 1586.patch

This patch addresses the issue described, and also doubles the number of handlers for the TaskUmbilicalProtocol server to 2*maxCurrentTasks.

> Progress reporting thread can afford to be slightly lenient towards exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1586
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1586.patch
>
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to three) attempts are made to report progress/ping (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt failures are counted as critical. Here I am proposing a variant - treat only ConnectException exceptions are critical and treat the others as non-critical. The other exception could be the SocketTimeoutException in the case of the two RPCs. 
> The reason why I am proposing this is that since HADOOP-1462 went in, I have been seeing quite a few unexpected 65 deaths, and with some logging it appears that they happen, most of the time, due to the SocketTimeoutException in the progress RPC call (before HADOOP-1462, the return value of progress would not be checked). And when the hack described above was put in, things improved considerably. 
> One argument that one might make against the above proposal is that the tasktracker could be faulty, when a task is not able to successfully invoke an RPC on it even though it is able to connect. If this is indeed the case, even in the current scheme of things, the only resort is to restart the tasktracker (either manually, or, the JobTracker asks it to reinitialize), and in both the cases, normal behavior of the protocol will ensure that the child task will die (since the reinited tasktracker is going to return false for the progress/ping calls).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.