You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zsxwing <gi...@git.apache.org> on 2014/12/24 06:35:52 UTC

[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/3783

    [SPARK-4951][Core] Fix the issue that a busy executor may be killed

    Three changes to fix this issue:
    
    1. Handle the case that receiving `SparkListenerTaskStart` before `SparkListenerBlockManagerAdded`.
    2. Don't add `executorId` to `removeTimes` when the executor is busy.
    3. Use `HashMap.retain` to safely traverse the HashMap and remove items.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark SPARK-4951

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3783.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3783
    
----
commit d5c615d04bcfa00a37a0dc678a2bd702246bbd1e
Author: zsxwing <zs...@gmail.com>
Date:   2014-12-24T04:11:19Z

    Fix the issue that a busy executor may be killed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22760840
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -478,12 +500,21 @@ private[spark] class ExecutorAllocationManager(
         /**
          * An estimate of the total number of pending tasks remaining for currently running stages. Does
          * not account for tasks which may have failed and been resubmitted.
    +     *
    +     * Note: This is not thread-safe without the caller owning the `allocationManager` lock.
          */
         def totalPendingTasks(): Int = {
           stageIdToNumTasks.map { case (stageId, numTasks) =>
             numTasks - stageIdToTaskIndices.get(stageId).map(_.size).getOrElse(0)
           }.sum
         }
    +
    +    /**
    +     * Note: This is not thread-safe without the caller owning the `allocationManager` lock.
    +     */
    --- End diff --
    
    This isn't what the function does. Even though it's obvious we should keep the java docs descriptive
    ```
    Return true if an executor is not currently running a task, and false otherwise.
    Note: ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68082164
  
    cc @andrewor14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22698300
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +426,44 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        allocationManager.onExecutorAdded(executorId)
    --- End diff --
    
    > the `SparkListenerTaskStart` event is posted after the `SparkListenerBlockManagerAdded` event
    
    I think you mean `before` instead of `after`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69481186
  
    Added comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69463862
  
    > We can do this by maintaining a set of such executors, and marking these as idle as soon as new executors join, for instance.
    
    We can just use `ExecutorAllocationListener.executorIdToTaskIds` to know if a executor is idle. See my new commit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69519502
  
    Ok LGTM thanks for fixing this tricky issue @zsxwing. I'm merging this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22698315
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -478,6 +486,8 @@ private[spark] class ExecutorAllocationManager(
         /**
          * An estimate of the total number of pending tasks remaining for currently running stages. Does
          * not account for tasks which may have failed and been resubmitted.
    +     *
    +     * Note: The caller must own the `allocationManager` lock before calling `totalPendingTasks`
    --- End diff --
    
    Your comment is much clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22760847
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +433,49 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        // This guards against the race condition in which the `SparkListenerTaskStart`
    +        // event is posted before the `SparkListenerBlockManagerAdded` event, which is
    +        // possible because these events are posted in different threads.
    --- End diff --
    
    same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22668329
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    > If an idle executor (call this executor X) is not removed because the minimum is reached, it won't be removed. However, its expireTime isn't changed. So after new executor joins, schedule() will remove this idle executor since it's expired.
    
    But its `expireTime` actually is already removed [here](https://github.com/apache/spark/blob/8d45834debc6986e61831d0d6e982d5528dccc51/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L215), so if we don't mark it idle when a new executor joins, it will never be removed. I think there are two possible solutions here:
    
    1) When an executor is not removed because the lower bound is reached, we set the expire time of that executor to a special value (e.g. -1) instead of removing the expire time from `removeTimes`. Then, when we add a new executor, we look for this special value and mark those executors idle. This is correct because if executor X is marked busy in the mean time, its `-1` expire time will be removed from `removeTimes`.
    
    2) We maintain a set of executor IDs that are marked busy. When we add a new executor, we mark all executors that are *not* busy idle.
    
    I personally prefer (1) because it doesn't introduce a new data structure, but it adds some complexity so we need to carefully document what `-1` means. (We should declare it in the `object`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22669152
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +426,44 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        allocationManager.onExecutorAdded(executorId)
    --- End diff --
    
    then you may need to add the same check in `onBlockManagerAdded` to prevent the duplicate executor registered warning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69130550
  
    > By the way, in the future it would be good to break this down into multiple issues and patches. 
    
    OK. 
    
    > It seems to me that only (2) actually corresponds to SPARK-4951. Let me know if this is not the case.
    
    Both (1) and (2) correspond to SPARK-4951.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69466165
  
    LGTM pending a few minor comment suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22631958
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -110,7 +117,12 @@ private[spark] abstract class YarnSchedulerBackend(
           case r: RequestExecutors =>
             amActor match {
               case Some(actor) =>
    -            sender ! AkkaUtils.askWithReply[Boolean](r, actor, askTimeout)
    +            val driverActor = sender
    --- End diff --
    
    ah I see, we need to save it because now we use it in a future


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69134471
  
      [Test build #25197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25197/consoleFull) for   PR 3783 at commit [`d4c4e9a`](https://github.com/apache/spark/commit/d4c4e9a08f593de42c7950d2ca9552d42b3850c9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69482382
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25370/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69406817
  
    The other thing is what to do about executors that should be removed but were not because of the lower bound. On a high level, we should not forget that these executors were once idle. We can do this by maintaining a set of such executors, and marking these as idle as soon as new executors join, for instance. (We could also do the `-1` thing I suggested earlier, but I think that solution is slightly more complicated in terms of code.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69130583
  
      [Test build #25197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25197/consoleFull) for   PR 3783 at commit [`d4c4e9a`](https://github.com/apache/spark/commit/d4c4e9a08f593de42c7950d2ca9552d42b3850c9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621413
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -110,7 +117,12 @@ private[spark] abstract class YarnSchedulerBackend(
           case r: RequestExecutors =>
             amActor match {
               case Some(actor) =>
    -            sender ! AkkaUtils.askWithReply[Boolean](r, actor, askTimeout)
    +            val driverActor = sender
    --- End diff --
    
    any reason to save this into a `val driverActor`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621158
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -478,6 +483,8 @@ private[spark] class ExecutorAllocationManager(
         /**
          * An estimate of the total number of pending tasks remaining for currently running stages. Does
          * not account for tasks which may have failed and been resubmitted.
    +     *
    +     * Note: The current thread must own the allocationManager's monitor.
    --- End diff --
    
    Hm, what does this mean? Not sure if this comment is particularly useful


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68037376
  
      [Test build #24772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24772/consoleFull) for   PR 3783 at commit [`105ba3a`](https://github.com/apache/spark/commit/105ba3acea521a77122a016faa6674793d1ff696).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68084388
  
    To make ExecutorAllocationManager more robust, it should not use SparkListener, because the event delivery is unreliable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69106025
  
    Hey @zsxwing thanks for fixing these. I believe points 3, 4, 5 are covered sufficiently in this PR. I still don't fully understanding (1). What is the race condition there, and which part of your patch fixes that? As for (2), I left an inline comment explaining why the original code is written as such, and why the corresponding change in this PR overlooks another issue.
    
    By the way, in the future it would be good to break this down into multiple issues and patches. It seems to me that only (2) actually corresponds to SPARK-4951. Let me know if this is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621233
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -110,7 +117,12 @@ private[spark] abstract class YarnSchedulerBackend(
           case r: RequestExecutors =>
             amActor match {
               case Some(actor) =>
    -            sender ! AkkaUtils.askWithReply[Boolean](r, actor, askTimeout)
    +            val driverActor = sender
    +            Future {
    +              driverActor ! AkkaUtils.askWithReply[Boolean](r, actor, askTimeout)
    +            } onFailure {
    +              case NonFatal(e) => logError(s"Ask ${r} unsuccessfully", e)
    --- End diff --
    
    `$r` is sufficient


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69481097
  
      [Test build #25370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25370/consoleFull) for   PR 3783 at commit [`d51fa0d`](https://github.com/apache/spark/commit/d51fa0dbc68f7d862a216c1245b38d2ecd29fa60).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69465512
  
    Ah great, I didn't realize we already maintain such a set in the listener.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68030915
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24760/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3783


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68028107
  
      [Test build #24760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24760/consoleFull) for   PR 3783 at commit [`d5c615d`](https://github.com/apache/spark/commit/d5c615d04bcfa00a37a0dc678a2bd702246bbd1e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22744355
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    Hm, but if that's the case we'll never remove the executor in the other cases in `removeExecutor`, i.e.
    ```
    logWarning(s"Attempted to remove unknown executor $executorId!")
    ```
    and
    ```
    logWarning(s"Attempted to remove executor $executorId " +
            s"when it is already pending to be removed!")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69466712
  
      [Test build #25363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25363/consoleFull) for   PR 3783 at commit [`2e365ce`](https://github.com/apache/spark/commit/2e365cefe0361319b86be6c68c1d1b8f9c324f03).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22763770
  
    --- Diff: core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala ---
    @@ -597,6 +607,41 @@ class ExecutorAllocationManagerSuite extends FunSuite with LocalSparkContext {
         assert(removeTimes(manager).size === 1)
       }
     
    +  test("call onTaskStart before onBlockManagerAdded") {
    +    sc = createSparkContext(2, 10)
    +    val manager = sc.executorAllocationManager.get
    +    assert(executorIds(manager).isEmpty)
    +    assert(removeTimes(manager).isEmpty)
    +
    +    sc.listenerBus.postToAll(SparkListenerTaskStart(0, 0, createTaskInfo(0, 0, "executor-1")))
    +    sc.listenerBus.postToAll(SparkListenerBlockManagerAdded(
    +      0L, BlockManagerId("executor-1", "host1", 1), 100L))
    +    assert(executorIds(manager).size === 1)
    +    assert(executorIds(manager).contains("executor-1"))
    +    assert(removeTimes(manager).size === 0)
    +  }
    +
    +  test("onExecutorAdded should not add a busy executor to removeTimes") {
    --- End diff --
    
    I added `SPARK-4951` to the test name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68032741
  
      [Test build #24772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24772/consoleFull) for   PR 3783 at commit [`105ba3a`](https://github.com/apache/spark/commit/105ba3acea521a77122a016faa6674793d1ff696).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22668989
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +426,44 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        allocationManager.onExecutorAdded(executorId)
    --- End diff --
    
    This seems incorrect to me. This marks the executor idle every time it runs a task, which will result in many duplicate executor warnings (see [this line](https://github.com/apache/spark/blob/8d45834debc6986e61831d0d6e982d5528dccc51/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L325)). I think we should do a check instead, and add a huge comment explaining the ordering issue:
    ```
    // This guards against the race condition in which the `SparkListenerTaskStart`
    // event is posted after the `SparkListenerBlockManagerAdded` event, which is
    // possible because these events are posted in different threads
    if (!allocationManager.executorIds.contains(executorId)) {
      allocationManager.onExecutorAdded(executorId)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22760832
  
    --- Diff: core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala ---
    @@ -597,6 +607,41 @@ class ExecutorAllocationManagerSuite extends FunSuite with LocalSparkContext {
         assert(removeTimes(manager).size === 1)
       }
     
    +  test("call onTaskStart before onBlockManagerAdded") {
    +    sc = createSparkContext(2, 10)
    +    val manager = sc.executorAllocationManager.get
    +    assert(executorIds(manager).isEmpty)
    +    assert(removeTimes(manager).isEmpty)
    +
    +    sc.listenerBus.postToAll(SparkListenerTaskStart(0, 0, createTaskInfo(0, 0, "executor-1")))
    +    sc.listenerBus.postToAll(SparkListenerBlockManagerAdded(
    +      0L, BlockManagerId("executor-1", "host1", 1), 100L))
    +    assert(executorIds(manager).size === 1)
    +    assert(executorIds(manager).contains("executor-1"))
    +    assert(removeTimes(manager).size === 0)
    +  }
    +
    +  test("onExecutorAdded should not add a busy executor to removeTimes") {
    --- End diff --
    
    Can you add
    ```
    // SPARK-4951
    ```
    on top of both of these test cases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621244
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -119,7 +131,12 @@ private[spark] abstract class YarnSchedulerBackend(
           case k: KillExecutors =>
             amActor match {
               case Some(actor) =>
    -            sender ! AkkaUtils.askWithReply[Boolean](k, actor, askTimeout)
    +            val driverActor = sender
    +            Future {
    +              driverActor ! AkkaUtils.askWithReply[Boolean](k, actor, askTimeout)
    +            } onFailure {
    +              case NonFatal(e) => logError(s"Ask ${k} unsuccessfully", e)
    --- End diff --
    
    `$k`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22698276
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    > But its expireTime actually is already removed here, so if we don't mark it idle when a new executor joins, it will never be removed.
    
    But I have changed these codes to:
    
    ```
    
        removeTimes.retain { case (executorId, expireTime) =>
          now < expireTime || !removeExecutor(executorId)
        }
    ```
    It won't remove `(executorId, expireTime)` if `removeExecutor(executorId)` return false.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22631968
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +426,44 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        allocationManager.onExecutorAdded(executorId)
    --- End diff --
    
    `allocationManager.onExecutorAdded(executorId)` is the key to fix the order issue of `SparkListenerTaskStart` and `SparkListenerBlockManagerAdded`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69134474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25197/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69405655
  
    Hi @zsxwing this is looking pretty close. However, there is still one thing about the current changes that I'm not sure about, as I have discussed in the comments inline.
    
    The use of `retain` in `schedule` actually changes the semantics a little. Before we would always remove `executorId` from `removeTimes` if the time expired, but now we do this only if we we actually removed the associated executor. The problem with this is that every time we call schedule we'll keep trying to remove it and failing and this will lead to a lot of noise. I think instead we should try to maintain the old semantics as follows:
    ```
    removeTimes.retain { case (executorId, expireTime) =>
      val expired = now >= expireTime
      if (expired) { removeExecutor(executorId) }
      expired
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69463950
  
      [Test build #25363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25363/consoleFull) for   PR 3783 at commit [`2e365ce`](https://github.com/apache/spark/commit/2e365cefe0361319b86be6c68c1d1b8f9c324f03).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69289480
  
      [Test build #25288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25288/consoleFull) for   PR 3783 at commit [`49f61a9`](https://github.com/apache/spark/commit/49f61a901c14082eb391f77214aeb9456b03a8bf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68037381
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24772/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22622117
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -373,10 +372,12 @@ private[spark] class ExecutorAllocationManager(
        * the executor is not already marked as idle.
        */
       private def onExecutorIdle(executorId: String): Unit = synchronized {
    -    if (!removeTimes.contains(executorId) && !executorsPendingToRemove.contains(executorId)) {
    -      logDebug(s"Starting idle timer for $executorId because there are no more tasks " +
    -        s"scheduled to run on the executor (to expire in $executorIdleTimeout seconds)")
    -      removeTimes(executorId) = clock.getTimeMillis + executorIdleTimeout * 1000
    +    if (executorIds.contains(executorId)) {
    +      if (!removeTimes.contains(executorId) && !executorsPendingToRemove.contains(executorId)) {
    +        logDebug(s"Starting idle timer for $executorId because there are no more tasks " +
    +          s"scheduled to run on the executor (to expire in $executorIdleTimeout seconds)")
    +        removeTimes(executorId) = clock.getTimeMillis + executorIdleTimeout * 1000
    +      }
         }
    --- End diff --
    
    can you add an else case here with
    ```
    logWarning(s"Attempted to mark unknown executor $executorId idle")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22632968
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    If an idle executor (call this executor X) is not removed because the minimum is reached, it won't be removed. However, its expireTime isn't changed. So after new executor joins, `schedule()` will remove this idle executor since it's expired.
    
    Do you want to reset the `expireTime` of idle executors when a new Executor joins?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22760861
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +319,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
    --- End diff --
    
    This deserves a loud comment, perhaps something like:
    ```
    If an executor is not removed because the lower bound has been reached, mark it idle again when new executors join because we are no longer at the lower bound. Otherwise, we may never remove these executors until after they run tasks again.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68050039
  
      [Test build #24784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24784/consoleFull) for   PR 3783 at commit [`05f6238`](https://github.com/apache/spark/commit/05f6238e988a54aada24ce85272212717fdc8c4e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69284237
  
    > Can you explain how (1) is related to SPARK-4951? It seems to me that (2) is sufficient in triggering the issue.
    
    The original implementation will mark an exeuctor idle when receiving `SparkListenerBlockManagerAdded`.
    
    So if `SparkListenerTaskStart` is received before `SparkListenerBlockManagerAdded`, when receiving `SparkListenerBlockManagerAdded`, the executor will be marked idle even if there is a task running in it. Therefore, the executor will be killed when it's expired.
    
    That's why I said it's related. Of cause, we can also say (1) is a special case of (2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22631793
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -110,7 +117,12 @@ private[spark] abstract class YarnSchedulerBackend(
           case r: RequestExecutors =>
             amActor match {
               case Some(actor) =>
    -            sender ! AkkaUtils.askWithReply[Boolean](r, actor, askTimeout)
    +            val driverActor = sender
    --- End diff --
    
    `sender` is a `def`. If we use it in the `Future`, because the Future will run in another thread, there is a chance that when the codes in Future starts to run, the Actor is processing other messages, the return value of `sender` will not be the original one.
    
    That's why I cache the `sender` to `driverActor` before creating the Future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22744863
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    Wait, but if we simply don't remove `executorId` from `removeExecutor` when we've reached the minimum, we will try to remove it again as soon as the next time `schedule` is called. This means we'll be constantly calling `removeExecutor` on that executor and failing. I think we should mark it specially as "should be removed as soon as new executors join" instead. (and we can do this with the -1 thing I mentioned earlier)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68030911
  
      [Test build #24760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24760/consoleFull) for   PR 3783 at commit [`d5c615d`](https://github.com/apache/spark/commit/d5c615d04bcfa00a37a0dc678a2bd702246bbd1e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621840
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -209,11 +211,8 @@ private[spark] class ExecutorAllocationManager(
           addTime += sustainedSchedulerBacklogTimeout * 1000
         }
     
    -    removeTimes.foreach { case (executorId, expireTime) =>
    -      if (now >= expireTime) {
    -        removeExecutor(executorId)
    -        removeTimes.remove(executorId)
    -      }
    +    removeTimes.retain {
    +      case (executorId, expireTime) => !(now >= expireTime && removeExecutor(executorId))
         }
    --- End diff --
    
    ```
    removeTimes.retain { case (executorId, expireTime) =>
      now < expireTime || !removeExecutor(executorId)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621058
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -65,6 +65,8 @@ private[spark] class ExecutorAllocationManager(
         listenerBus: LiveListenerBus,
         conf: SparkConf)
       extends Logging {
    +allocationManager =>
    --- End diff --
    
    indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68045928
  
      [Test build #24784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24784/consoleFull) for   PR 3783 at commit [`05f6238`](https://github.com/apache/spark/commit/05f6238e988a54aada24ce85272212717fdc8c4e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69482380
  
      [Test build #25370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25370/consoleFull) for   PR 3783 at commit [`d51fa0d`](https://github.com/apache/spark/commit/d51fa0dbc68f7d862a216c1245b38d2ecd29fa60).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22760845
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -466,7 +483,12 @@ private[spark] class ExecutorAllocationManager(
         override def onBlockManagerAdded(blockManagerAdded: SparkListenerBlockManagerAdded): Unit = {
           val executorId = blockManagerAdded.blockManagerId.executorId
           if (executorId != SparkContext.DRIVER_IDENTIFIER) {
    -        allocationManager.onExecutorAdded(executorId)
    +        // This guards against the race condition in which the `SparkListenerTaskStart`
    +        // event is posted before the `SparkListenerBlockManagerAdded` event, which is
    +        // possible because these events are posted in different threads.
    --- End diff --
    
    Can you add `(see SPARK-4951)` at the end, here and in the other place where this comment appears


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69222825
  
    I left a few more comments in-line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69466714
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25363/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22622559
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -315,7 +314,7 @@ private[spark] class ExecutorAllocationManager(
       private def onExecutorAdded(executorId: String): Unit = synchronized {
         if (!executorIds.contains(executorId)) {
           executorIds.add(executorId)
    -      executorIds.foreach(onExecutorIdle)
    +      onExecutorIdle(executorId)
    --- End diff --
    
    The original rationale for marking other executors as idle is as follows. If an idle executor (call this executor X) is not removed because the minimum is reached, we need to mark it idle again when a new executor joins because we are no longer at the minimum. Otherwise, we may never removed executor X.
    
    The issue here, however, is that executor X could be running a task, in which case it is no longer idle and we shouldn't mark it as such. This means we still need to mark it as idle **only** if it is not actually running anything. Does that make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69289485
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25288/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22744136
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -426,39 +426,44 @@ private[spark] class ExecutorAllocationManager(
           }
         }
     
    -    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = synchronized {
    +    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
           val stageId = taskStart.stageId
           val taskId = taskStart.taskInfo.taskId
           val taskIndex = taskStart.taskInfo.index
           val executorId = taskStart.taskInfo.executorId
     
    -      // If this is the last pending task, mark the scheduler queue as empty
    -      stageIdToTaskIndices.getOrElseUpdate(stageId, new mutable.HashSet[Int]) += taskIndex
    -      val numTasksScheduled = stageIdToTaskIndices(stageId).size
    -      val numTasksTotal = stageIdToNumTasks.getOrElse(stageId, -1)
    -      if (numTasksScheduled == numTasksTotal) {
    -        // No more pending tasks for this stage
    -        stageIdToNumTasks -= stageId
    -        if (stageIdToNumTasks.isEmpty) {
    -          allocationManager.onSchedulerQueueEmpty()
    +      allocationManager.synchronized {
    +        allocationManager.onExecutorAdded(executorId)
    --- End diff --
    
    yes, that's what I meant


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22621509
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala ---
    @@ -97,6 +101,9 @@ private[spark] abstract class YarnSchedulerBackend(
       private class YarnSchedulerActor extends Actor {
         private var amActor: Option[ActorRef] = None
     
    +    implicit val askAmActorExecutor =
    +      ExecutionContext.fromExecutor(Utils.newDaemonCachedThreadPool("yarn-scheduler-ask-executor"))
    --- End diff --
    
    I would say `yarn-scheduler-ask-am-executor` to be more specific


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3783#discussion_r22669098
  
    --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -478,6 +486,8 @@ private[spark] class ExecutorAllocationManager(
         /**
          * An estimate of the total number of pending tasks remaining for currently running stages. Does
          * not account for tasks which may have failed and been resubmitted.
    +     *
    +     * Note: The caller must own the `allocationManager` lock before calling `totalPendingTasks`
    --- End diff --
    
    This is not quite true, right? I think what you mean is more like
    ```
    Note: This is not thread-safe without the caller owning the `allocationManager` lock.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69126622
  
    For (1), because `SparkListenerTaskStart` and `SparkListenerBlockManagerAdded` are sent in different `Actor`s, Akka doesn't guarantee the order. When I tested, `SparkListenerBlockManagerAdded` is usually received later because it comes from another node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-68050042
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24784/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4951][Core] Fix the issue that a busy e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3783#issuecomment-69284805
  
      [Test build #25288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25288/consoleFull) for   PR 3783 at commit [`49f61a9`](https://github.com/apache/spark/commit/49f61a901c14082eb391f77214aeb9456b03a8bf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org