You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by markhamstra <gi...@git.apache.org> on 2014/12/02 08:41:09 UTC

[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

GitHub user markhamstra opened a pull request:

    https://github.com/apache/spark/pull/3550

    [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor

    The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master.  Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.
    
    @JoshRosen 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markhamstra/spark SPARK-4498

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3550.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3550
    
----
commit 8f543b10222bf5006326d3fe0605d54474175501
Author: Mark Hamstra <ma...@gmail.com>
Date:   2014-12-02T06:54:07Z

    Don't transition ExecutorInfo to RUNNING until Executor is added by Driver

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65200648
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24033/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65193667
  
      [Test build #24033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24033/consoleFull) for   PR 3550 at commit [`8f543b1`](https://github.com/apache/spark/commit/8f543b10222bf5006326d3fe0605d54474175501).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65264098
  
    Yeah, this seems safe to me. Even if the Master doesn't know that the driver has exited for some reason (i.e. if the `finishApplication` was somehow not triggered from the `DisassociatedEvent` [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L411)), this will still fail the application correctly because all existing executors will have exited and new executors will fail immediately because they can't connect to the driver. I agree that this is a better fix for 1.2. Eventually it would be good to get Josh's changes in too because it's easier to test things there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65194915
  
    The application won't be killed if an executor has been recognized by master as RUNNING (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L328).  The buggy host will just keep trying and failing to launch executors.
    
    Detecting and blacklisting buggy hosts seems like a separable and complex issue.  It would also be a new feature that maybe we don't want to add to 1.2 at the last minute.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65200639
  
      [Test build #24033 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24033/consoleFull) for   PR 3550 at commit [`8f543b1`](https://github.com/apache/spark/commit/8f543b10222bf5006326d3fe0605d54474175501).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65269257
  
    It's worth spending a little time checking that any executors that are RUNNING for an application definitely will transition to a Finished state and be removed from the master's accounting if the application dies.  If we are certain that all the running executors will finish after application death and that repeatedly failing executors from a bad node while a running executor remains on master's books will not progressively consume resources, then I think this PR solves the problems.  The only sort-of negative that I am seeing is that there can be an arbitrarily large number of failed executor launch attempts while at least one executor remains running, which will at least fill up error logs; but that is arguably not an all bad thing and is something whose proper resolution can be better handled (at least for now) by a system administrator than by an attempt to automate resolution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65307121
  
    One idea for testing this: comment out the line in the `DisassociationEvent` handler that removes the application then check that a killed application is eventually removed via this mechanism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4498][core] Don't transition ExecutorIn...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3550#issuecomment-65194277
  
    I considered something like this, but I think that this re-introduces cases where a single bad host can cause the entire application to fail.  Imagine that I have a cluster where all but one of the hosts are functioning correctly; I'll register executors on the good hosts once at the beginning of the app and can then experience an infinite number of executor launch failures on the buggy host since we don't have a blacklist.  So, we might have a case where the application is able to make progress with the executors that it has but is killed due to failed attempts to acquire more executors, since all of the resets/decrements to the "progress towards failure" counter will only occurred at the beginning of the app, while the increments occurred continuously.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org