You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mridul Muralidharan (JIRA)" <ji...@apache.org> on 2014/06/21 11:12:25 UTC

[jira] [Comment Edited] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections

    [ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039742#comment-14039742 ] 

Mridul Muralidharan edited comment on SPARK-704 at 6/21/14 9:10 AM:
--------------------------------------------------------------------

If remote node goes down, SendingConnection would be notified since it is also registered for read events (to handle precisely this case actually).
ReceivingConnection would be notified since it is waiting on reads on that socket.

This, ofcourse, assumes that local node detects remote node failure at tcp layer.
Problems come in when this is not detected due to no activity on the socket (at app and socket level - keepalive timeout, etc).
Usually this is detected via application level ping/keepalive messages :  not sure if we want to introduce that into spark ...


was (Author: mridulm80):
If remote node goes down, SendingConnection would be notified since it is also registered for read events (to handle precisely this case actually).
ReceivingConnection would anyway be notified since it is waiting on reads on that socket.

This, ofcourse, assumes that local node detects remote node failure at tcp layer.
Problems come in when 

> ConnectionManager sometimes cannot detect loss of sending connections
> ---------------------------------------------------------------------
>
>                 Key: SPARK-704
>                 URL: https://issues.apache.org/jira/browse/SPARK-704
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Charles Reiss
>            Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections disconnect except if it is trying to send through them. As a result, a node failure just after a connection is initiated but before any acknowledgement messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the failure of a corresponding ReceivingConnection, but this code assumes that the remote host:port of the ReceivingConnection is the same as the ConnectionManagerId, which is almost never true. Additionally, there does not appear to be any reason to assume a corresponding ReceivingConnection will exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)