You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Kats (JIRA)" <ji...@apache.org> on 2017/07/20 15:06:02 UTC

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

    [ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094797#comment-16094797 ] 

David Kats commented on SPARK-15544:
------------------------------------

Confirming the same issue with Spark 2.1.0 and 2.2.0, ubuntu 14.04, zookeeper 3.4.5

017-07-20 12:48:25,151 INFO ClientCnxn: Client session timed out, have not heard from server in 35022ms for sessionid 0x15d5fb6dc7d0009, closing socket connection and attempting reconnect
2017-07-20 12:48:25,254 INFO ConnectionStateManager: State change: SUSPENDED
2017-07-20 12:48:25,268 INFO ZooKeeperLeaderElectionAgent: We have lost leadership
2017-07-20 12:48:25,295 ERROR Master: Leadership has been revoked -- master shutting down.


> Bouncing Zookeeper node causes Active spark master to exit
> ----------------------------------------------------------
>
>                 Key: SPARK-15544
>                 URL: https://issues.apache.org/jira/browse/SPARK-15544
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>            Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x154dfc0426b0054, likely server has closed socket, closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x254c701f28d0053, likely server has closed socket, closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org