You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2008/11/25 20:09:44 UTC

[jira] Created: (HADOOP-4724) TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up

TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up
----------------------------------------------------------------------------------------------

                 Key: HADOOP-4724
                 URL: https://issues.apache.org/jira/browse/HADOOP-4724
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Hairong Kuang
             Fix For: 0.20.0


TaskTracker, DataNode, and SecondaryNameNode currently wait forever if its server is not up. They should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes/hours or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4724) TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651331#action_12651331 ] 

Steve Loughran commented on HADOOP-4724:
----------------------------------------

>Something like datanode.connect.timeout, tasktracker.connect.timeout, dfsclient.connect.timeout...

Maybe include the fact that this is for IPC timeouts, not say http

datanode.ipc.connect.timeout
tasktracker.ipc.connect.timeout
dfsclient.ipc.connect.timeout

>I am thinking to start with a large number like 1 hour or 1 day. It is at least backwards compatible.

24 hours would be good. It lets you handle the kind of outage that has the team paged in from home and removes the "fix this in 15 minutes before the nodes start giving up" crisis

> TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4724
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> TaskTracker, DataNode, and SecondaryNameNode currently wait forever if its server is not up. They should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes/hours or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4724) TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650701#action_12650701 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4724:
------------------------------------------------

+1 for this change 

Currently, waitForProxy(...) waits forever if there are ConnectException. It seems not right since if the server in the other side is down, the client cannot detect it but keep waiting.

> TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4724
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> TaskTracker, DataNode, and SecondaryNameNode currently wait forever if its server is not up. They should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes/hours or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4724) TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650958#action_12650958 ] 

Steve Loughran commented on HADOOP-4724:
----------------------------------------

+1 for this. It's useful for development and  easy to test (start these nodes with nothing to bond to), 

* What property names to use?
* What are reasonable defaults for production systems? 

For minidfs we can run with a configuration that times out much faster; the default timeout should be adequate for people setting up basic clusters on  real/virtual machines without any assumptions about NTP working :)

> TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4724
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> TaskTracker, DataNode, and SecondaryNameNode currently wait forever if its server is not up. They should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes/hours or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4724) TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651195#action_12651195 ] 

Hairong Kuang commented on HADOOP-4724:
---------------------------------------

>  What property names to use?
Should we have one property per kind of ipc client? Something like datanode.connect.timeout, tasktracker.connect.timeout, dfsclient.connect.timeout...

>  What are reasonable defaults for production systems?
I am thinking to start with a large number like 1 hour or 1 day. It is at least backwards compatible. 

> TaskTracker, DataNode, and SecondaryNameNode should timeout on waiting for its server to be up
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4724
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> TaskTracker, DataNode, and SecondaryNameNode currently wait forever if its server is not up. They should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes/hours or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.