You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by lianhuiwang <gi...@git.apache.org> on 2015/02/04 13:35:23 UTC

[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

GitHub user lianhuiwang opened a pull request:

    https://github.com/apache/spark/pull/4367

    [SPARK-5529][Core]Replace blockManager's timeoutChecking with executor's timeoutChecking

    the phenomenon is:
    blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.
    Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too.
    Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap.
    
    so i think that we can remove timeoutChecking of BlockManagerMasterActor. and add executor's timeoutChecking of HeartbeatReceiver.
    if executor is timeout in HeartbeatReceiver, 
    Firstly,we tell TaskSchedulerImpl to executorLost and TaskSchedulerImpl will tell dagScheduler executorLost, then dagScheduler will tell blockManagerMaster to remove BlockManager of this executor.
    Next, we tell CoarseGrainedSchedulerBackend to kill executor that is timeout by SparkContext.killExecutor api.
    In the future, if we remove akka and implement ourself RPC, we just need to replace akka. and timeoutChecking to HeartbeatReceiver can be reserved for other RPC.
    Maybe we should change "spark.storage.blockManagerSlaveTimeoutMs" to "spark.executor.slaveTimeoutMs", "spark.storage.blockManagerTimeoutIntervalMs" to "spark.executor.timeoutIntervalMs"?
    @rxin @tdas @sryza @andrewor14 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lianhuiwang/spark SPARK-5529

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4367
    
----
commit aeb74b02a5521185c2cb571388b577a7af4e8da9
Author: lianhuiwang <li...@gmail.com>
Date:   2015-02-04T12:27:33Z

    Replace blockManager's timeoutChecking of BlockManagerMasterActor with executor's timeoutChecking of HeartbeatReceiver

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by lianhuiwang <gi...@git.apache.org>.
Github user lianhuiwang closed the pull request at:

    https://github.com/apache/spark/pull/4367


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72847591
  
      [Test build #26749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26749/consoleFull) for   PR 4367 at commit [`aeb74b0`](https://github.com/apache/spark/commit/aeb74b02a5521185c2cb571388b577a7af4e8da9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-73611756
  
    I'm going to link a fairly similar patch that aims to solve the same issue here #4363. I believe the approach there is the same, and for the same reason there I think we should close this PR for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72860569
  
      [Test build #26752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26752/consoleFull) for   PR 4367 at commit [`0686355`](https://github.com/apache/spark/commit/06863550103cbee75cd0d16f6e2b8b07e32ac7b8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72860578
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26752/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72847793
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26749/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4367#discussion_r24375405
  
    --- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ---
    @@ -32,18 +36,64 @@ private[spark] case class Heartbeat(
         taskMetrics: Array[(Long, TaskMetrics)], // taskId -> TaskMetrics
         blockManagerId: BlockManagerId)
     
    +private[spark] case object ExpireDeadHosts
    +
     private[spark] case class HeartbeatResponse(reregisterBlockManager: Boolean)
     
     /**
      * Lives in the driver to receive heartbeats from executors..
      */
    -private[spark] class HeartbeatReceiver(scheduler: TaskScheduler)
    +private[spark] class HeartbeatReceiver(sc: SparkContext, scheduler: TaskScheduler)
       extends Actor with ActorLogReceive with Logging {
     
    +  val executorLastSeen = new mutable.HashMap[String, Long]
    +
    +  val slaveTimeout = sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs",
    +    math.max(sc.conf.getInt("spark.executor.heartbeatInterval", 10000) * 3, 120000))
    +
    +  val checkTimeoutInterval = sc.conf.getLong("spark.storage.blockManagerTimeoutIntervalMs", 60000)
    +
    +  var timeoutCheckingTask: Cancellable = null
    +
    +  override def preStart() {
    +    import context.dispatcher
    +    timeoutCheckingTask = context.system.scheduler.schedule(0.seconds,
    +      checkTimeoutInterval.milliseconds, self, ExpireDeadHosts)
    +    super.preStart()
    +  }
    +
       override def receiveWithLogging = {
         case Heartbeat(executorId, taskMetrics, blockManagerId) =>
    +      heartbeatReceived(executorId)
           val response = HeartbeatResponse(
             !scheduler.executorHeartbeatReceived(executorId, taskMetrics, blockManagerId))
           sender ! response
    +    case ExpireDeadHosts =>
    +      expireDeadHosts()
    +  }
    +
    +  private def heartbeatReceived(executorId: String) = {
    +    executorLastSeen(executorId) = System.currentTimeMillis()
    +  }
    +
    +  private def expireDeadHosts() {
    +    logTrace("Checking for hosts with no recent heart beats in HeartbeatReceiver.")
    +    val now = System.currentTimeMillis()
    +    val minSeenTime = now - slaveTimeout
    +    for ((executorId, lastSeenMs) <- executorLastSeen) {
    +      if (lastSeenMs < minSeenTime) {
    +        logWarning("Removing Executor " + executorId + " with no recent heart beats: "
    +          + (now - lastSeenMs) + " ms exceeds " + slaveTimeout + "ms")
    +        scheduler.executorLost(executorId, SlaveLost())
    +        sc.killExecutor(executorId)
    --- End diff --
    
    this will fail an assertion in standalone / mesos mode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72847791
  
      [Test build #26749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26749/consoleFull) for   PR 4367 at commit [`aeb74b0`](https://github.com/apache/spark/commit/aeb74b02a5521185c2cb571388b577a7af4e8da9).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-72850605
  
      [Test build #26752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26752/consoleFull) for   PR 4367 at commit [`0686355`](https://github.com/apache/spark/commit/06863550103cbee75cd0d16f6e2b8b07e32ac7b8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5529][Core]Replace blockManager's timeo...

Posted by lianhuiwang <gi...@git.apache.org>.
Github user lianhuiwang commented on the pull request:

    https://github.com/apache/spark/pull/4367#issuecomment-73625959
  
    yes ,i will close this PR. thanks all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org