You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Stephan Kepser (JIRA)" <ji...@apache.org> on 2016/10/28 10:22:00 UTC

[jira] [Issue Comment Deleted] (SPARK-18159) Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice

     [ https://issues.apache.org/jira/browse/SPARK-18159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Kepser updated SPARK-18159:
-----------------------------------
    Comment: was deleted

(was: I saw the old executors kept running for several hours (more than 5h). 
And we have a Stand-alone Spark cluster without Yarn or Mesos. Thus using yarn to kill the old executors is unfortunately not an option. And killing the old executors via the REST API also failed. They are immediately re-started. )

> Stand-alone cluster, supervised app: restart of worker hosting the driver causes app to run twice
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18159
>                 URL: https://issues.apache.org/jira/browse/SPARK-18159
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.2
>            Reporter: Stephan Kepser
>            Priority: Critical
>
> We use Spark in stand-alone cluster mode with HA with three master nodes. All aps are submitted using
> > spark-submit --deploy-mode cluster --supervised --master ...
> We have many apps running. 
> The deploy-mode cluster is needed to prevent the drivers of the apps to be all placed on the active master. 
> If a worker goes down that hosts a driver, the following happens:
> * the driver is started on another worker node
> * the new driver does not connect to the still running app
> * the new driver starts a new instance of the running app
> * there are now two instances of the app running, 
>   * one with an attached new driver,
>   * one without a driver.
> * the old instance of the app cannot effectively be stop. I.e., it can be kill via the UI, but is immediately restarted.
> Iterating this process causes more and more instances of the app running.
> To get the effect both options --deploy-mode cluster and --supervised are required.  
> The only remedy we know of is reboot all linux nodes the cluster runs on.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org