You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/10/09 03:58:33 UTC

[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

    [ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164578#comment-14164578 ] 

Patrick Wendell commented on SPARK-3736:
----------------------------------------

I spoke a bit offline with [~ilikerps] about this. I think the solution here is pretty simple - if the worker disconnects it should just try to re-initialize the connection to all drivers. It might need some slight refactoring so that on re-connect it will do this for an infinite number of attempts and checking to make sure there aren't races.

A good first step would be to get a grasp of how the general fault tolerance code works here around connections (there is a bit of complexity here around having failover between masters). Checkout the documentation on the Spark website about standalone fault tolerance. Right now the worker will simply hang out and do nothing when it loses the connection to the master, because it's expecting another master to re-connect to it. But this won't occur during the case where there is master failure.

> Workers should reconnect to Master if disconnected
> --------------------------------------------------
>
>                 Key: SPARK-3736
>                 URL: https://issues.apache.org/jira/browse/SPARK-3736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Andrew Ash
>            Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some reason it never attempts to reconnect.  In this situation you have to bounce the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org