You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mccheah <gi...@git.apache.org> on 2014/10/16 20:20:05 UTC

[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/2828

    [SPARK-3736] Workers reconnect when disassociated from the master.

    Before, if the master node is killed and restarted, the worker nodes
    would not attempt to reconnect to the Master. Therefore, when the Master
    node was restarted, the worker nodes needed to be restarted as well.
    
    Now, when the Master node is disconnected, the worker nodes will
    continuously ping the master node in attempts to reconnect to it. Once
    the master node restarts, it will detect one of the registration
    requests from its former workers. The result is that the cluster
    re-enters a healthy state.
    
    In addition, when the master does not receive a heartbeat from the
    worker, the worker was removed; however, when the worker sent a
    heartbeat to the master, the master used to ignore the heartbeat. Now,
    a master that receives a heartbeat from a worker that had been
    disconnected will request the worker to re-attempt the registration
    process, at which point the worker will send a RegisterWorker request
    and be re-connected accordingly.
    
    Re-connection attempts per worker are submitted every N seconds, where N
    is configured by the property spark.worker.reconnect.interval - this has
    a default of 60 seconds right now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark reconnect-dead-workers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2828.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2828
    
----
commit b5b34af964199af296e12490413225f55d93a6cd
Author: mcheah <mc...@palantir.com>
Date:   2014-10-15T22:27:21Z

    [SPARK-3736] Workers reconnect when disassociated from the master.
    
    Before, if the master node is killed and restarted, the worker nodes
    would not attempt to reconnect to the Master. Therefore, when the Master
    node was restarted, the worker nodes needed to be restarted as well.
    
    Now, when the Master node is disconnected, the worker nodes will
    continuously ping the master node in attempts to reconnect to it. Once
    the master node restarts, it will detect one of the registration
    requests from its former workers. The result is that the cluster
    re-enters a healthy state.
    
    In addition, when the master does not receive a heartbeat from the
    worker, the worker was removed; however, when the worker sent a
    heartbeat to the master, the master used to ignore the heartbeat. Now,
    a master that receives a heartbeat from a worker that had been
    disconnected will request the worker to re-attempt the registration
    process, at which point the worker will send a RegisterWorker request
    and be re-connected accordingly.
    
    Re-connection attempts per worker are submitted every N seconds, where N
    is configured by the property spark.worker.reconnect.interval - this has
    a default of 60 seconds right now.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59439367
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21822/consoleFull) for   PR 2828 at commit [`94ddeca`](https://github.com/apache/spark/commit/94ddeca5390c5746b767b99ab8086d651e474978).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59814026
  
    sure, I created the JIRA: https://issues.apache.org/jira/browse/SPARK-4011



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18986488
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    I see....then I will vote to do something different with Hadoop by reusing registrationRetryTimer....otherwise the inconsistency of the logic in the two *similar* code blocks makes the program a bit fishy.... 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59408043
  
    One remark is that there are no automated tests in this commit for now.
    
    I was unsuccessful in setting up TestKit to emulate a worker and master sending messages to each other. I also have not seen any other unit tests that test message passing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18976148
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -341,7 +341,11 @@ private[spark] class Master(
             case Some(workerInfo) =>
               workerInfo.lastHeartbeat = System.currentTimeMillis()
             case None =>
    -          logWarning("Got heartbeat from unregistered worker " + workerId)
    +          if (workers.map(_.id).contains(workerId)) {
    --- End diff --
    
    Should this have a not in it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18985880
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -243,6 +249,10 @@ private[spark] class Worker(
             System.exit(1)
           }
     
    +    case ReconnectWorker(masterUrl) =>
    +      logWarning(s"Master with url $masterUrl requested this worker to reconnect.")
    --- End diff --
    
    Why Warning?  Seems more natural to me for the disconnect/failure to reply to be a WARN, but the subsequent reconnect request and related actions to just be INFO-level events.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18986941
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    One other option is to have the logic factored out into a separate block that passes in an optional number of times to retry, and set the option to None in reconnect and set the option to the appropriate number on initial startup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r19002923
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    I dug into Hadoop source and actually found out that the default policy for Hadoop reconnects is to retry every 10 seconds for 6 attempts, and then every 60 seconds for 10 attempts.  Each attempt also has a fuzz factor applied of [0.5t, 1.5t] to prevent a thundering herd of reconnect attempts across the cluster.
    
    I don't have a strong opinion on infinite vs ~10min of retries -- I'd vote for following Hadoop's lead unless presented with compelling arguments to do something different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18988140
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    not sure about the motivation of that Hadoop let tasktracker retries forever.....might be different with our case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59425292
  
    add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59435763
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21818/consoleFull) for   PR 2828 at commit [`a698e35`](https://github.com/apache/spark/commit/a698e356b05129ef2e7b9fadd73a1f2d9184c5a0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59815368
  
    This looks good to me.  Thanks!  I'm going to merge this into `master`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59803881
  
    @JoshRosen agreed with @ash211, this is really good.
    
    Are there any actual comments on the PR, or can it be merged? =)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-64316156
  
    It looks like this patch may have introduced a race-condition / bug during multi-master failover: https://issues.apache.org/jira/browse/SPARK-4592.  I'm working on a fix, but thought I'd mention the JIRA here in case any of this patch's reviewers would be interested in providing feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59435771
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21818/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59814881
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21926/consoleFull) for   PR 2828 at commit [`83f8bc9`](https://github.com/apache/spark/commit/83f8bc9a18c9d2663127f4f8142ae3f5273db2d2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18977288
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -341,7 +341,11 @@ private[spark] class Master(
             case Some(workerInfo) =>
               workerInfo.lastHeartbeat = System.currentTimeMillis()
             case None =>
    -          logWarning("Got heartbeat from unregistered worker " + workerId)
    +          if (workers.map(_.id).contains(workerId)) {
    --- End diff --
    
    The above observation is correct - only workers that have previously registered with the master are allowed to reconnect. Workers that are connecting for the first time shouldn't be allowed to spawn a heartbeat and have the master send back a reconnection message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-64323860
  
    Andrew's got a patch for this: #3447


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59823485
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21926/consoleFull) for   PR 2828 at commit [`83f8bc9`](https://github.com/apache/spark/commit/83f8bc9a18c9d2663127f4f8142ae3f5273db2d2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59833561
  
    Don't worry about it. This test is a little flaky and will be fixed shortly. I highly doubt that the test failure is caused by this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59430399
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21822/consoleFull) for   PR 2828 at commit [`94ddeca`](https://github.com/apache/spark/commit/94ddeca5390c5746b767b99ab8086d651e474978).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18977018
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -341,7 +341,11 @@ private[spark] class Master(
             case Some(workerInfo) =>
               workerInfo.lastHeartbeat = System.currentTimeMillis()
             case None =>
    -          logWarning("Got heartbeat from unregistered worker " + workerId)
    +          if (workers.map(_.id).contains(workerId)) {
    --- End diff --
    
    @ash211 what he is trying to do seems to be that, only before we decide this worker is DEAD, we allow the reconnect


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59810532
  
    @markhamstra , yeah, my concern is just this, though Worker is marked as private[spark], is it a good practice to expose every detail in the implementation to the other components....?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r19011614
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    Thank you, @ash211 
    
    I think prolonging the interval between attempts and adding some jitter there is very reasonable....maybe in this patch, we can also change the current implementation of registrationRetryTimer?@mccheah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18988742
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    @ash211 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59598067
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull) for   PR 2828 at commit [`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59596562
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull) for   PR 2828 at commit [`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18986188
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    In that case we can't directly use registrationRetryTimer, as that explicitly kills the worker after a certain number of retries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59639085
  
    This is EXCELLENT work @JoshRosen !  Looking forward to future integration tests that cover these sorts of behaviors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59810037
  
    @CodingCat, Worker is private[spark], so what is the nature of your concern?  In fact, I'm wondering whether we really want the changes in this PR that make some methods inaccessible from the rest of spark.  I haven't looked at the accessibility of Worker's methods in detail to say for certain what the correct modifier should be in each case; but if we want to change them, that's a refactoring that can and should be addressed in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59824518
  
    The PR doesn't seem to be related to the unit tests that failed. How shall we tackle this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59812443
  
    As a general principle, you should use the most private access modifiers that are sufficient.  We can always make methods / fields _more_ visible, but it's much harder to remove / change functionality once it's been exposed to other components.
    
    W.r.t. refactoring, I agree with Mark: a large-scale refactoring of access modifiers should happen in a separate PR, not here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r19102413
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -166,26 +178,47 @@ private[spark] class Worker(
         }
       }
     
    -  def registerWithMaster() {
    -    tryRegisterAllMasters()
    -    var retries = 0
    -    registrationRetryTimer = Some {
    -      context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) {
    -        Utils.tryOrExit {
    -          retries += 1
    -          if (registered) {
    -            registrationRetryTimer.foreach(_.cancel())
    -          } else if (retries >= REGISTRATION_RETRIES) {
    -            logError("All masters are unresponsive! Giving up.")
    -            System.exit(1)
    -          } else {
    -            tryRegisterAllMasters()
    +  private def retryConnectToMaster() {
    +    logInfo("ping")
    +    Utils.tryOrExit {
    +      connectionAttemptCount += 1
    +      if (registered) {
    +        registrationRetryTimer.foreach(_.cancel())
    +        registrationRetryTimer = None
    +      } else if (connectionAttemptCount <= TOTAL_REGISTRATION_RETRIES) {
    +        tryRegisterAllMasters()
    +        if (connectionAttemptCount == INITIAL_REGISTRATION_RETRIES) {
    +          registrationRetryTimer.foreach(_.cancel())
    +          registrationRetryTimer = Some {
    +            context.system.scheduler.schedule(PROLONGED_REGISTRATION_RETRY_INTERVAL,
    +              PROLONGED_REGISTRATION_RETRY_INTERVAL)(retryConnectToMaster)
               }
             }
    +      } else {
    +        logError("All masters are unresponsive! Giving up.")
    +        System.exit(1)
           }
         }
       }
     
    +  def registerWithMaster() {
    +    // DisassociatedEvent may be triggered multiple times, so don't attempt registration
    +    // if there are outstanding registration attempts scheduled.
    +    registrationRetryTimer match {
    +      case None =>
    +        registered = false
    +        tryRegisterAllMasters()
    +        connectionAttemptCount = 0
    +        registrationRetryTimer = Some {
    +          context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
    +          INITIAL_REGISTRATION_RETRY_INTERVAL)(retryConnectToMaster)
    --- End diff --
    
    I'd indent this line by another two spaces to make it more obvious that it's a continuation of the previous line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59807896
  
    @JoshRosen , this is awesome to test Spark integration with Docker
    
    @mccheah , this PR is LGTM now, except that we exposed too many should-be-private members in Worker (not your fault, existing in the current code).. not sure about the reason....@pwendell @markhamstra you have some insights about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59406399
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18978412
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    is it possible to reuse registrationRetryTimer in Worker?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59598071
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21873/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18976036
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -94,6 +96,7 @@ private[spark] class Worker(
       val finishedExecutors = new HashMap[String, ExecutorRunner]
       val drivers = new HashMap[String, DriverRunner]
       val finishedDrivers = new HashMap[String, DriverRunner]
    +  var scheduledReconnectMessage: Option[Cancellable] = None
    --- End diff --
    
    scheduledReconnectTask? when I looked at this variable, I expected it to be some case class representing the message itself


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18976594
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -365,6 +375,16 @@ private[spark] class Worker(
       def masterDisconnected() {
    --- End diff --
    
    shall this method be private? we call it somewhere else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r19102299
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -64,8 +66,17 @@ private[spark] class Worker(
       // Send a heartbeat every (heartbeat timeout) / 4 milliseconds
       val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4
     
    -  val REGISTRATION_TIMEOUT = 20.seconds
    -  val REGISTRATION_RETRIES = 3
    +  val INITIAL_REGISTRATION_RETRIES = 6
    --- End diff --
    
    Maybe add a comment above this line to say that this is modeled after Hadoop's design.  This will help future maintainers to understand this code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18985861
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    according to @ash211 , "The preferred alternative is to follow what Hadoop does – when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec).", I think we can do the same thing...just let the thread try infinitely 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59602394
  
    **tl;dr**: _this patch looks pretty good to me based on the testing that I've done so far.  For my own interest / fun, I'd like to find a way to extend my test coverage to include the "worker-initiated reconnect" and "master restart" cases, but my tests shouldn't necessarily block the merging / review of this patch._
    
    To summarize my understanding of the failure scenarios that this PR addresses:
    
    - A worker becomes disassociated from the master, due to either:
      - The master failing and restarting
      - A transient network issue
    
    These scenarios are similar but there's one subtle distinction.  In the first scenario, the master forgets all previously-registered workers.  In the second scenario, the master can remember that a worker was previously-registered even though it may now be disassociated.
    
    In some of these scenarios, a disconnection may be reflected at the master, worker, or both (perhaps at different times).  For example, a master might deregister a worker if it has not received Spark-level heartbeats from it, or a worker might disassociate from a master due to the Akka failure detector being triggered.
     
    After this PR, there are two paths that can lead to a worker reconnection:
    
    - A master (which stayed alive) receives a heartbeat from previously-registered but now de-registered worker and asks that worker to reconnect.
    - A worker discovers that it has become disassociated from the master and attempts to initiate a reconnection.
    
    I've been working on building a Docker-based integration testing framework for testing these sorts of Spark Standalone fault-tolerance issues (to hopefully be released publicly sometime soon).
    
    I thought it would be interesting to test the "master stays alive but deregisters workers due to not receiving heartbeats" case by simulating network issues.  In my testing framework, I added a [Jepsen](https://github.com/aphyr/jepsen)-inspired network fault-injector which updates `iptables` rules in a boot2docker VM in order to temporarily break network links.  Here's the actual code that I wrote to test this PR:
    
    ```scala
    test("workers should reconnect to master if disconnected due to transient network issues") {
      // Regression test for SPARK-3736
      val env = Seq(
        "SPARK_MASTER_OPTS" -> "-Dspark.worker.timeout=2",
        "SPARK_WORKER_OPTS" -> "-Dspark.worker.timeout=2 -Dspark.akka.timeout=1 -Dspark.akka.failure-detector.threshold=1 -Dspark.akka.heartbeat.interval=1"
      )
      cluster = SparkClusters.createStandaloneCluster(env, numWorkers = 1)
      val master = cluster.masters.head
      val worker = cluster.workers.head
      master.getState.liveWorkerIPs.size should be (1)
      println("Cluster launched with one worker")
    
      networkFaultInjector.dropTraffic(master.container, worker.container)
      networkFaultInjector.dropTraffic(worker.container, master.container)
      eventually(timeout(30 seconds), interval(1 seconds)) {
        master.getState.liveWorkerIPs.size should be (0)
      }
      println("Master shows that zero workers are registered after network connection fails")
    
      networkFaultInjector.restore()
      eventually(timeout(30 seconds), interval(1 seconds)) {
        master.getState.liveWorkerIPs.size should be (1)
      }
      println("Master shows one worker after network connection is restored")
    }
    ```
    
    While running this against the current Spark master: after I kill the network connection between the master and worker, the master more-or-less immediately times out the worker and disconnects it.  However, the worker doesn't realize that it has become deregistered from the master.  This happens because the master detects worker liveness using our own heartbeat mechanism, whereas the worker detects master liveness using Akka's failure-detection mechanisms (to see this, note that the worker's `masterDisconnected()` function is only invoked from the `DisassociatedEvent` message handler).
    
    As a result, we end up in a scenario where the master receives a heartbeat from the de-registered workers that does not realize that it has been deregistered.  This PR addresses this case by having the master explicitly ask the worker to reconnect (via the `ReconnectWorker` message).  Thanks to this mechanism, my test passes with this PR's code!
    
    I'm still working on testing the case where the worker receives a DisassociationEvent and initiates the reconnection itself.  To do this, I'll need to figure out how to configure the Akka failure detector so that it quickly fails in my testing suite.  I'll also need to add a way to query the worker to ask whether it has become disconnected from the master so that I can drop packets for long enough in order to cause a disassociation.
    
    For completeness, I should also test the case where I kill the master and bring it back up using the same hostname.  This may require a bit of extra scaffolding in my framework (which currently uses container IPs rather than hostnames that I control), but I think it's doable.
    
    That said, though, the code here seems reasonable.  Don't block on me here :smile: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18988031
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    hmmm.....now, I think exit after several retries might be better, 
    
    In your case, without restarting the worker after the restarting master may bring some problems, especially when the user didn't set RECOVERY_MODE,  all application information is lost, for instance, the application whose resource requirement hasn't been filled will not be served anymore....the complete system will run in a weird status, so you eventually need to restart the applications (i.e. kill executors -> restart , which is equivalent to restart all workers)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2828


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59823491
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21926/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59811803
  
    A legitimate concern, and certainly something that could be worked up into a JIRA issue and separate pull request.  But it's not a very pressing issue since nothing is in the public API, and a larger refactoring of Worker shouldn't be conflated with this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18986702
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    I'm okay with leaving it retrying indefinitely. The user may not notice the error until much later, and then reboot the master. If the workers decide to stop trying, the user will need to bounce the workers as well.
    
    I agree with having the logic being very similar is a bit of a pain, but these are really two different scenarios, so I could foresee such nearly-duplicated logic being justified in either direction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18978981
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -362,9 +372,19 @@ private[spark] class Worker(
         }
       }
     
    -  def masterDisconnected() {
    +  private def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  private def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    The logic would need to be refactored a bit, but it might be doable. It uses the registered flag to determine if it should stop attempts to re-register, and otherwise attempts to reconnect.
    
    If we toggle the registered flag upon disassociation as well we might be able to just call registerWithMaster(). However, I'm not sure what the implications of that are. Also, do we necessarily want the worker to give up reconnection after a certain number of retries in this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59439376
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21822/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r18976619
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -365,6 +375,16 @@ private[spark] class Worker(
       def masterDisconnected() {
         logError("Connection to master failed! Waiting for master to reconnect...")
         connected = false
    +    scheduleAttemptsToReconnectToMaster()
    +  }
    +
    +  def scheduleAttemptsToReconnectToMaster() {
    --- End diff --
    
    private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2828#issuecomment-59425990
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21818/consoleFull) for   PR 2828 at commit [`a698e35`](https://github.com/apache/spark/commit/a698e356b05129ef2e7b9fadd73a1f2d9184c5a0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2828#discussion_r19101917
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -166,26 +178,47 @@ private[spark] class Worker(
         }
       }
     
    -  def registerWithMaster() {
    -    tryRegisterAllMasters()
    -    var retries = 0
    -    registrationRetryTimer = Some {
    -      context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) {
    -        Utils.tryOrExit {
    -          retries += 1
    -          if (registered) {
    -            registrationRetryTimer.foreach(_.cancel())
    -          } else if (retries >= REGISTRATION_RETRIES) {
    -            logError("All masters are unresponsive! Giving up.")
    -            System.exit(1)
    -          } else {
    -            tryRegisterAllMasters()
    +  private def retryConnectToMaster() {
    +    logInfo("ping")
    --- End diff --
    
    This log message could be more informative.  I'd say something like
    
    ```scala
    logInfo(s"Attempting to connect to master (attempt # $connectionAttemptCount)")
    ```
    
    I'd also move this into the `Utils.tryOrExit` block so that we print the incremented `connectionAttemptCount`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org