You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2018/01/20 00:34:13 UTC

[jira] [Commented] (KUDU-2192) KRPC should have a timer to close stuck connections

    [ https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333051#comment-16333051 ] 

Mostafa Mokhtar commented on KUDU-2192:
---------------------------------------

[~kwho] [~sailesh] [~hubert.sun]

Tried network partitioning between two backends with KRPC enabled  on 10.00.000.28

sudo /sbin/iptables -I INPUT -s 10.00.000.29 -j DROP

And the query failed with the error below within 30 minutes

Query Status: TransmitData() to 10.00.000.28:27000 failed: Network error: recv error: Connection timed out (error 110)

Thrift failed in a similar way but in 15 minutes

 

 

> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
>                 Key: KUDU-2192
>                 URL: https://issues.apache.org/jira/browse/KUDU-2192
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Michael Ho
>            Priority: Major
>
> If the remote host goes down or its network gets unplugged, all pending RPCs to that host will be stuck if there is no timeout specified. While those RPCs which have finished sending their payloads or those which haven't started sending payloads can be cancelled quickly, those in mid-transmission (i.e. an RPC at the front of the outbound queue with part of its payload sent already) cannot be cancelled until the payload has been completely sent. Therefore, it's beneficial to have a timeout to kill a connection if it's not making any progress for an extended period of time so the RPC will fail and get unstuck. The timeout may need to be conservatively large to avoid aggressive closing of connections due to transient network issue. One can consider augmenting the existing maintenance thread logic which checks for idle connection to check for this kind of timeout. Please feel free to propose other alternatives (e.g. TPC keepalive timeout) in this JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)