You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ray Chiang (JIRA)" <ji...@apache.org> on 2015/04/06 23:40:14 UTC

[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

    [ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482010#comment-14482010 ] 

Ray Chiang commented on HADOOP-11252:
-------------------------------------

+1 (non-binding).  The code changes look fine to me.  As an easy test of its effect, I'm able to cause problems on my cluster by setting the value to 1ms.  Reasonable values run fine for me.

I'd like to see the new property properly documented.  From what I can see, core-default.xml looks like the right place.

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. This means that the network connection created will not timeout when used to write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for writes then fall back to the tcp level retry (configured via tcp_retries2) and timeouts between the 15-30 minutes. Which is too long for a default behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane value for the timeout and the "ipc.ping.interval" configuration value is a logical choice for it. The default behaviour should be changed from 0 to the value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)