You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2008/07/31 19:46:33 UTC

[jira] Created: (HADOOP-3881) IPC client doesnt time out if far end handler hangs

IPC client doesnt time out if far end handler hangs
---------------------------------------------------

                 Key: HADOOP-3881
                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
             Project: Hadoop Core
          Issue Type: Bug
          Components: ipc
            Reporter: Steve Loughran
            Priority: Minor


This is what appears to be happening in some changes of mine that (inadventently) blocked JobTracker: if the client can connect to the far end and invoke an operation, the far end has forever to deal with the request: the client blocks too.

Clearly the far end shouldn't do this; its a serious problem to address. but should the client hang? Should it not time out after some specifiable time and signal that the far end isn't processing requests in a timely manner? 

(marked as minor as this shouldn't arise in day to day operation. but it should be easy to create a mock object to simulate this, and timeouts are considered useful in an IPC)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3881) IPC client doesnt time out if far end handler hangs

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619046#action_12619046 ] 

Doug Cutting commented on HADOOP-3881:
--------------------------------------

> a bit of jitter is needed [ ... ]

There is jitter in block reports. and in ExponentialBackoffRetry.  I have not heard of folks having problems on cluster restart.

> Also, maybe the IPC and design decisions could be documented in the wiki

The problem with detailed code documentation separate from the code is that it quickly goes stale.  The internal design is dynamic.  What's better for this is good documentation in the code, since that is more naturally maintained as the code changes.  It's best to only use separate documentation for slower-moving targets like end-user API documentation and high-level architectural documentation.


> IPC client doesnt time out if far end handler hangs
> ---------------------------------------------------
>
>                 Key: HADOOP-3881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Steve Loughran
>            Priority: Minor
>
> This is what appears to be happening in some changes of mine that (inadventently) blocked JobTracker: if the client can connect to the far end and invoke an operation, the far end has forever to deal with the request: the client blocks too.
> Clearly the far end shouldn't do this; its a serious problem to address. but should the client hang? Should it not time out after some specifiable time and signal that the far end isn't processing requests in a timely manner? 
> (marked as minor as this shouldn't arise in day to day operation. but it should be easy to create a mock object to simulate this, and timeouts are considered useful in an IPC)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3881) IPC client doesnt time out if far end handler hangs

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618956#action_12618956 ] 

Steve Loughran commented on HADOOP-3881:
----------------------------------------

Yes, retries in this situation would not be ideal. Throwing some exception "timeout invoking InterTracker.heartbeat() on /127.0.0.1/8083 -possible deadlock" would be enough for developers. But production, well, it shouldn't show up. Shall I close this issue as INVALID?

If retry load is an issue then the whole client retry operations in TaskTracker and DataNode need to be looked at. There's a sleep, with the sleep time hard coded in the source. Which means that if the whole datacentre is synchronzied -as you get if the power gets toggled and they all boot up at the same time, there's a risk that all the nodes in the datacentre will hit the tracker/namenode simultaneously. Even exponential backoff doesnt work if the clocks are fully synchronized. it helps, but a bit of jitter is needed too just to round things off. There's enough complexity/duplication here that this could be pushed into a reused class.

Also, maybe the IPC and design decisions could be documented in the wiki


> IPC client doesnt time out if far end handler hangs
> ---------------------------------------------------
>
>                 Key: HADOOP-3881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Steve Loughran
>            Priority: Minor
>
> This is what appears to be happening in some changes of mine that (inadventently) blocked JobTracker: if the client can connect to the far end and invoke an operation, the far end has forever to deal with the request: the client blocks too.
> Clearly the far end shouldn't do this; its a serious problem to address. but should the client hang? Should it not time out after some specifiable time and signal that the far end isn't processing requests in a timely manner? 
> (marked as minor as this shouldn't arise in day to day operation. but it should be easy to create a mock object to simulate this, and timeouts are considered useful in an IPC)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3881) IPC client doesnt time out if far end handler hangs

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618786#action_12618786 ] 

Doug Cutting commented on HADOOP-3881:
--------------------------------------

This is by design.  The client used to time out, but, when servers got slow and clients retried, it led to server meltdown.  Think of this another way, if the call were local, not remote, should it time out?  At the RPC layer, nothing exceptional has happened: both ends are still up, responding to pings, etc.  So I don't see the current behavior as obviously wrong.  If the service hangs, and clients depend on that service, then clients will hang too.

That said, there may be a case, with non-singleton services, where a service could get hung and the client might reasonably retry a different service.  That doesn't currently apply much to Hadoop.  I can find only two protocols where clients talk to peers rather than superiors (InterDatanodeProtocol and ClientDatanodeProtocol).  In these cases, if the remote end were to hang indefinitely, its not clear what the client should do.  It would probably be bad, so perhaps we should add a way to time things out in these specific cases.


> IPC client doesnt time out if far end handler hangs
> ---------------------------------------------------
>
>                 Key: HADOOP-3881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Steve Loughran
>            Priority: Minor
>
> This is what appears to be happening in some changes of mine that (inadventently) blocked JobTracker: if the client can connect to the far end and invoke an operation, the far end has forever to deal with the request: the client blocks too.
> Clearly the far end shouldn't do this; its a serious problem to address. but should the client hang? Should it not time out after some specifiable time and signal that the far end isn't processing requests in a timely manner? 
> (marked as minor as this shouldn't arise in day to day operation. but it should be easy to create a mock object to simulate this, and timeouts are considered useful in an IPC)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3881) IPC client doesnt time out if far end handler hangs

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619069#action_12619069 ] 

Raghu Angadi commented on HADOOP-3881:
--------------------------------------

> If retry load is an issue then the whole client retry operations in [...]

The current implementation makes retry less relevant and not needed in most cases, right? One of the main motivation was handle burst-y load gracefully at the server. BlockReports do an exponential backoff right now, but it is not required and does not get triggered with the current  IPC implementation. It needs to be removed.

We could add a FAQ entry in Wiki briefly stating that IPC calls don't timeout. 

> IPC client doesnt time out if far end handler hangs
> ---------------------------------------------------
>
>                 Key: HADOOP-3881
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3881
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Steve Loughran
>            Priority: Minor
>
> This is what appears to be happening in some changes of mine that (inadventently) blocked JobTracker: if the client can connect to the far end and invoke an operation, the far end has forever to deal with the request: the client blocks too.
> Clearly the far end shouldn't do this; its a serious problem to address. but should the client hang? Should it not time out after some specifiable time and signal that the far end isn't processing requests in a timely manner? 
> (marked as minor as this shouldn't arise in day to day operation. but it should be easy to create a mock object to simulate this, and timeouts are considered useful in an IPC)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.