You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Behroz Sikander (JIRA)" <ji...@apache.org> on 2018/07/12 09:22:00 UTC

[jira] [Updated] (SPARK-24794) DriverWrapper should have both master addresses in -Dspark.master

     [ https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Behroz Sikander updated SPARK-24794:
------------------------------------
    Description: 
In standalone cluster mode, one could launch a Driver with supervise mode enabled. Spark launches the driver with a JVM argument -Dspark.master which is set to [host and port of current master|[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149]]

 

During the life of context, the spark masters can switch due to any reason. After that if the driver dies unexpectedly and comes up it tries to connect with the master which was set initially with -Dspark.master but that master is in STANDBY mode. The context tries multiple times to connect to standby and then just kills itself.

 

*Suggestion:*

While launching the driver process, Spark master should use the [spark.master passed as input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124] instead of master and port of the current master.

Log messages that we observe:

 
{code:java}
2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077..
.....
2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 org.apache.spark.network.client.TransportClientFactory []: Successfully created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in bootstraps)
.....
2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application has been killed. Reason: All masters are unresponsive! Giving up.{code}

  was:
In standalone cluster mode, one could launch a Driver with supervise mode enabled. Spark launches the driver with a JVM argument -Dspark.master which is set to [host and port of current master|[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].]

 

During the life of context, the spark masters can switch due to any reason. After that if the driver dies unexpectedly and comes up it tries to connect with the master which was set initially with -Dspark.master but that master is in STANDBY mode. The context tries multiple times to connect to standby and then just kills itself.

 

*Suggestion:*

While launching the driver process, Spark master should use the [spark.master passed as input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124] instead of master and port of the current master.

Log messages that we observe:

 
{code:java}
2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077..
.....
2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 org.apache.spark.network.client.TransportClientFactory []: Successfully created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in bootstraps)
.....
2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
.....
2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application has been killed. Reason: All masters are unresponsive! Giving up.{code}


> DriverWrapper should have both master addresses in -Dspark.master
> -----------------------------------------------------------------
>
>                 Key: SPARK-24794
>                 URL: https://issues.apache.org/jira/browse/SPARK-24794
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 2.2.1
>            Reporter: Behroz Sikander
>            Priority: Major
>
> In standalone cluster mode, one could launch a Driver with supervise mode enabled. Spark launches the driver with a JVM argument -Dspark.master which is set to [host and port of current master|[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149]]
>  
> During the life of context, the spark masters can switch due to any reason. After that if the driver dies unexpectedly and comes up it tries to connect with the master which was set initially with -Dspark.master but that master is in STANDBY mode. The context tries multiple times to connect to standby and then just kills itself.
>  
> *Suggestion:*
> While launching the driver process, Spark master should use the [spark.master passed as input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124] instead of master and port of the current master.
> Log messages that we observe:
>  
> {code:java}
> 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077..
> .....
> 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 org.apache.spark.network.client.TransportClientFactory []: Successfully created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in bootstraps)
> .....
> 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application has been killed. Reason: All masters are unresponsive! Giving up.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org