You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liyezhang556520 <gi...@git.apache.org> on 2014/12/29 10:00:04 UTC

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

GitHub user liyezhang556520 opened a pull request:

    https://github.com/apache/spark/pull/3825

    [SPARK-4991][CORE] Worker should reconnect to Master when Master actor restart

    This is a following JIRA of [SPARK-4989](https://issues.apache.org/jira/browse/SPARK-4991). when Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available.
    
    In this PR, master will tell worker the connection is disconnected, so that worker will register to master again. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liyezhang556520/spark workerReconn

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3825.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3825
    
----
commit 107e5c58fdbe143fe6eabcfdb5d91d7b1184bb35
Author: Zhang, Liye <li...@intel.com>
Date:   2014-12-29T07:35:45Z

    worker reconnect to master when master restart for exception

commit e9c99e3969f6e058e46d65575d796d1289351318
Author: Zhang, Liye <li...@intel.com>
Date:   2014-12-29T08:51:50Z

    add log info

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68246474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24859/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68429462
  
    I'd rather use fewer Akka features than more, since this will make it easier to replace Akka with our own RPC layer in the future.  Therefore, I'd much prefer to not allow exceptions to trigger actor restarts / state clearing.  I think that adding an experimental Akka feature like persistence would be a huge risk for little obvious gain.
    
    I'm not sure if the "heartbeat from unknown worker" can ever occur if we don't clear the master's state because I think that workers only begin sending heartbeats once a master has ack'd their registration in which case the master would know that it was a previously-registered worker and instruct it to reconnect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68286943
  
    More specifically, I guess I'm suggesting that we modify wrap the `receive` and `receiveWithLogging` methods of our actors with try-catch blocks to log any exceptions that bubble up to the top of the actors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68445280
  
    It doesn't seem to me that usage of the newer Akka persistence API is called for, but it does seem that wrapping the `receive` in a try-catch is trying to do the job for which Akka's `SupervisorStrategy` is intended.  I can't recommend the hand-rolled try-catch approach.
    
    http://doc.akka.io/docs/akka/2.3.4/general/supervision.html#supervision


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by liyezhang556520 <gi...@git.apache.org>.

Github user liyezhang556520 commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68684212
  
    @JoshRosen , If we want to use the supervision mechanism. We need to add another actor level as parent of the current Master actor. I don't know if that is suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by liyezhang556520 <gi...@git.apache.org>.

Github user liyezhang556520 commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68424069
  
    Hi @JoshRosen , it'll be a little weird to wrap the `receive` or `receiveWithLogging` method with try-catch blocks. And also this is conflict with the fault tolerance design of Akka, which is supposed to handle exception within supervisor. If you think it's fine to add a try-catch block to `receive` method. Then I'll make an additional commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68246470
  
      [Test build #24859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24859/consoleFull) for   PR 3825 at commit [`e9c99e3`](https://github.com/apache/spark/commit/e9c99e3969f6e058e46d65575d796d1289351318).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class MasterDisconnected(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68241590
  
      [Test build #24859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24859/consoleFull) for   PR 3825 at commit [`e9c99e3`](https://github.com/apache/spark/commit/e9c99e3969f6e058e46d65575d796d1289351318).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by liyezhang556520 <gi...@git.apache.org>.

Github user liyezhang556520 commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68428197
  
    I think a better way is to use the Akka's [persistence](http://doc.akka.io/docs/akka/snapshot/scala/persistence.html) feature, recover the actor's state when actor restart. 
    
    Anyway, still this PR has it's value that when an unregistered worker has heartbeat to this Master, this master should not just ignore it, at least should tell the worker it has disconnected. What do you think @JoshRosen ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68721665
  
    @liyezhang556520 That's been done already in the DAGScheduler.  If we need another level of supervision for Master or other actors, we should consider whether these actors need separate Supervisors or whether they can be combined in the supervision hierarchy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by liyezhang556520 <gi...@git.apache.org>.

Github user liyezhang556520 commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68828355
  
    @markhamstra , thanks for reminder, I'll update this PR by making a try to introduce the supervision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68648457
  
    @markhamstra From that page:
    
    > Depending on the nature of the work to be supervised and the nature of the failure, the supervisor has a choice of the following four options:
    > 
    > 1. Resume the subordinate, keeping its accumulated internal state
    > 2. Restart the subordinate, clearing out its accumulated internal state
    > 3. Stop the subordinate permanently
    > 4. Escalate the failure, thereby failing itself
    
    It sounds like we want approach 1, resuming the subordinate without losing its state, so I'd be in favor of reworking this PR to use that type of supervision strategy.  I don't think that extending our actors to support restart necessarily makes sense, since it adds a lot of complexity (for instance, I don't think that this PR handles loss of the `appInfo` or `driverInfo` data structures).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3825#issuecomment-68284859
  
    Wouldn't it be better to ensure that actors like Master and DAGScheduler never die due to uncaught exceptions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68650323

@JoshRosen your thinking is that Master will be in good shape even though an exception has been thrown? If you can guarantee that, then resuming the actor while keeping the accumulated state should do the job. Otherwise, things get more complicated. Within the lengthy process of handling exceptions thrown within the DAGScheduler (https://github.com/apache/spark/pull/186), we ended up taking the conservative approach of restarting the whole system instead of trying to restart the DAGScheduler actor with fixed or reconstructed state. I haven't dug into the details of this PR yet, so I can't say for certain, but there are probably lessons to be learned from that DAGScheduler epic PR.

Something else that we'll need to consider at some point if other actors start requiring supervision strategies other than the default is what the overall structure of the supervision hierarchy should be. Right now, only the DAGScheduler has another level of supervision, but perhaps Spark actors from outside the DAGScheduler should also be handled under one or more levels of common supervision.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org