You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by markhamstra <gi...@git.apache.org> on 2014/07/10 20:18:37 UTC

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

GitHub user markhamstra opened a pull request:

    https://github.com/apache/spark/pull/1360

    SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors 

    Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.
    
    Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.
    
    This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not.  I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that.  This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markhamstra/spark SPARK-2425

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1360
    
----
commit 5b85534d376d682b7e1f97f98acd532a305349f8
Author: Mark Hamstra <ma...@gmail.com>
Date:   2014-07-09T23:02:43Z

    SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition
    and prevent Master from removing Application with RUNNING Executors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48643785
  
    QA tests have started for PR 1360. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53164487
  
    Take a look at 'spark.scheduler.executorTaskBlacklistTime' in TaskSetManager.
    Since I run mostly in yarn-cluster mode, and there is only single application there; I was not sure how relevant black-listing was in your case actually ! (multiple apps via standalone I guess ?)
    
    Note that we actually need a third case, which is not yet handled, slow executors/stragglers - particularly for low latency stages, they really kill execution times for some of our ML jobs (a 50x speedup becomes much much lower due to these).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54865668
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48656101
  
    QA results for PR 1360:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48656112
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-51960128
  
    QA tests have started for PR 1360. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53094602
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19085/consoleFull) for   PR 1360 at commit [`206a448`](https://github.com/apache/spark/commit/206a44853edf5caea9ce2d8ae83af3d999f5cf41).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48848235
  
    QA tests have started for PR 1360. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48850708
  
    QA results for PR 1360:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-51967252
  
    QA results for PR 1360:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48643572
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48656115
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53087747
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19085/consoleFull) for   PR 1360 at commit [`206a448`](https://github.com/apache/spark/commit/206a44853edf5caea9ce2d8ae83af3d999f5cf41).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17250396
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
    --- End diff --
    
    This will crash if `idToApp` doesn't already contain `appId`. Maybe we should log a warning instead of throwing a key not found exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-51959659
  
    @pwendell Still should go into 1.1.0...  The change is fairly small, and the unpatched behavior is pretty nasty for long-running applications.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54909182
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20006/consoleFull) for   PR 1360 at commit [`f099c0b`](https://github.com/apache/spark/commit/f099c0b2654759ab5cbfe2bc91cedac10f3bf77f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17250527
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
    --- End diff --
    
    Ah, I realize you moved this from elsewhere. It would still be good to add a safeguard


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54921279
  
    Thanks, I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17250993
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -234,7 +234,7 @@ private[spark] class Worker(
             try {
               logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
               val manager = new ExecutorRunner(appId, execId, appDesc, cores_, memory_,
    -            self, workerId, host, sparkHome, workDir, akkaUrl, conf, ExecutorState.RUNNING)
    +            self, workerId, host, sparkHome, workDir, akkaUrl, conf, ExecutorState.LOADING)
    --- End diff --
    
    Is this part of fixing the bug, or just to make the logic in the existing state machine clearer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17255465
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
    --- End diff --
    
    That's not necessary.  If `idToApp` does not contain `appId`, then `execOption` will be `None` and we'll never end up in this `case Some(exec)`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17254222
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
               exec.state = state
    +          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
               exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
               if (ExecutorState.isFinished(state)) {
    -            val appInfo = idToApp(appId)
                 // Remove this executor from the worker and app
    -            logInfo("Removing executor " + exec.fullId + " because it is " + state)
    +            logInfo(s"Removing executor ${exec.fullId} because it is $state")
                 appInfo.removeExecutor(exec)
                 exec.worker.removeExecutor(exec)
     
    -            val normalExit = exitStatus.exists(_ == 0)
    +            val normalExit = exitStatus == Some(0)
    --- End diff --
    
    Just seemed even clearer to directly compare Options instead of using a HOF.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54876396
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19987/consoleFull) for   PR 1360 at commit [`f099c0b`](https://github.com/apache/spark/commit/f099c0b2654759ab5cbfe2bc91cedac10f3bf77f).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54868231
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19987/consoleFull) for   PR 1360 at commit [`f099c0b`](https://github.com/apache/spark/commit/f099c0b2654759ab5cbfe2bc91cedac10f3bf77f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53163562
  
    @markhamstra In our cluster, this usually happens due to one or more executor being in a bad state : either due to insufficient disk for finishing a task or it is in process of cleaning up and exit'ing.
    When task fails, usually due to locality, the same task gets re-assigned to the executor where it just failed usually due to locality match. And this repeated re-schedule on failing executor, fail loop causes application to fail since it hits the maximum number of failed tasks for application, or maximum number of task failures for a specific task (iirc there are two params).
    
    We alleviate this by setting blacklist timeout to a non trivially appropriate value : this prevents the rapid reschedule of a failing task on the executor (and usually some other executor picks up the task - the timeout is chosen so that this is possible).
    If the executor is healthy but cant execute this specific task, then blacklist works fine.
    If executor is unhealthy and going to exit, then we will still have rapid task failures until executor notifies master when it exits - but the failure count per task is not hit (iirc the number of failed tasks for app is much higher than number of failed attempts per tasks).
    
    
    Ofcourse, not sure if this is completely applicable in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48762929
  
    QA tests have started for PR 1360. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-50921999
  
    QA results for PR 1360:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-50915845
  
    QA tests have started for PR 1360. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-50281668
  
    ping
    
    Probably too late for a 1.0.2-rc, but this should go into 1.0.3 and 1.1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17250755
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
               exec.state = state
    +          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
               exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
               if (ExecutorState.isFinished(state)) {
    -            val appInfo = idToApp(appId)
                 // Remove this executor from the worker and app
    -            logInfo("Removing executor " + exec.fullId + " because it is " + state)
    +            logInfo(s"Removing executor ${exec.fullId} because it is $state")
                 appInfo.removeExecutor(exec)
                 exec.worker.removeExecutor(exec)
     
    -            val normalExit = exitStatus.exists(_ == 0)
    +            val normalExit = exitStatus == Some(0)
                 // Only retry certain number of times so we don't go into an infinite loop.
    -            if (!normalExit && appInfo.incrementRetryCount < ApplicationState.MAX_NUM_RETRY) {
    -              schedule()
    -            } else if (!normalExit) {
    -              logError("Application %s with ID %s failed %d times, removing it".format(
    -                appInfo.desc.name, appInfo.id, appInfo.retryCount))
    -              removeApplication(appInfo, ApplicationState.FAILED)
    +            if (!normalExit) {
    +              if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
    +                schedule()
    +              } else {
    +                val execs = idToApp(appId).executors.values
    --- End diff --
    
    You can just use the `appInfo` you declared up there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53163415
  
    I'm not sure I'm following, @mridulm.  The problem is not one of removing Executors, but rather of removing Applications that could and should still be left running even though some (but not all) Executors assigned to an Application are dying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54914053
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20006/consoleFull) for   PR 1360 at commit [`f099c0b`](https://github.com/apache/spark/commit/f099c0b2654759ab5cbfe2bc91cedac10f3bf77f).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48643262
  
    @aarondav


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53101677
  
    Will blacklisting executors with appropriate value not solve this ? (If
    executor is still usable, but some tasks fail on the specific executor for
    whatever reason).
    
    The more general problem of detecting straddlers and/or "misbehaving"
    executors is something we have yet to do in spark.
    On 22-Aug-2014 11:10 pm, "Apache Spark QA" <no...@github.com> wrote:
    
    > QA tests have finished
    > <https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19085/consoleFull>
    > for PR 1360 at commit 206a448
    > <https://github.com/apache/spark/commit/206a44853edf5caea9ce2d8ae83af3d999f5cf41>
    > .
    >
    >    - This patch *passes* unit tests.
    >    - This patch merges cleanly.
    >    - This patch adds no public classes.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1360#issuecomment-53094602>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17257157
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
    --- End diff --
    
    Ah you're right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-54908417
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53164183
  
    @mridulm Is this blacklisting behavior a customization that you have made to Spark?  If not, could you point me to where and how it is implemented?
    
    What you are describing seems to be orthogonal and probably complementary to this PR: Yours, a means to prevent rescheduling of a task on an Executor where it cannot run successfully vs. this one, a means to prevent the killing of a running Application when some Executors die but others are still running the Application successfully.  Sounds to me like we want both of those means.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17255027
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ---
    @@ -234,7 +234,7 @@ private[spark] class Worker(
             try {
               logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
               val manager = new ExecutorRunner(appId, execId, appDesc, cores_, memory_,
    -            self, workerId, host, sparkHome, workDir, akkaUrl, conf, ExecutorState.RUNNING)
    +            self, workerId, host, sparkHome, workDir, akkaUrl, conf, ExecutorState.LOADING)
    --- End diff --
    
    It's part of the fix.  ExecutorRunners now start in state LOADING and don't become RUNNING until their process is running and stderr && stdout are successfully opened -- see the changes in this PR to ExecutorRunner.scala.  If ExecutorRunners were to continue to be created in state RUNNING, then new check for RUNNING Executors in Master.scala will always prevent Applications from being killed even though useful Executors doing actual work may not exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48643589
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1360#discussion_r17250489
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala ---
    @@ -295,28 +295,34 @@ private[spark] class Master(
           val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
           execOption match {
             case Some(exec) => {
    +          val appInfo = idToApp(appId)
               exec.state = state
    +          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
               exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
               if (ExecutorState.isFinished(state)) {
    -            val appInfo = idToApp(appId)
                 // Remove this executor from the worker and app
    -            logInfo("Removing executor " + exec.fullId + " because it is " + state)
    +            logInfo(s"Removing executor ${exec.fullId} because it is $state")
                 appInfo.removeExecutor(exec)
                 exec.worker.removeExecutor(exec)
     
    -            val normalExit = exitStatus.exists(_ == 0)
    +            val normalExit = exitStatus == Some(0)
    --- End diff --
    
    Any reason to change this? The old code seems pretty clear


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1360


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-48773641
  
    QA results for PR 1360:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---