You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2014/10/31 16:29:35 UTC

[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

    [ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191944#comment-14191944 ] 

Ming Ma commented on HADOOP-11252:
----------------------------------

Agree with Wilfred that we should fix it in hadoop-common. YARN-2714 is another example. While  HDFS-4858 has addressed the issue between DN and NN, we still have issue between client and NN.

1. Have a client machine make a RPC request to NN.
2. Power off NN while NN is processing the RPC.
3. The client machine can detect connection issue in around 16 minutes. 

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. This means that the network connection created will not timeout when used to write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for writes then fall back to the tcp level retry (configured via tcp_retries2) and timeouts between the 15-30 minutes. Which is too long for a default behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane value for the timeout and the "ipc.ping.interval" configuration value is a logical choice for it. The default behaviour should be changed from 0 to the value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)