You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Avik Sil (JIRA)" <ji...@apache.org> on 2016/07/22 09:11:21 UTC

[jira] [Comment Edited] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

    [ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389189#comment-15389189 ] 

Avik Sil edited comment on SPARK-15544 at 7/22/16 9:11 AM:
-----------------------------------------------------------

I am also seeing the same issue with spark 1.3.0, ubuntu 14.04, zookeeper 3.4.5

We have a 3 node cluster with spark and zookeeper. We also have a automatic restarter service which checks for the status of spark master every 5 min and restarts it if it is not running. So when the master shuts down after its leadership is revoked, the restarter service starts spark master within 5 min. But in *few cases* we don't see any ALIVE spark master in any of the three nodes - we don't see any "We have gained leadership" message in any of the 3 nodes.

So in a sense the workaround suggested in the above comment does not work.

From spark-defaults.conf:

spark.deploy.recoveryMode ZOOKEEPER
spark.deploy.zookeeper.url 192.168.42.2:28000,192.168.42.3:28000,192.168.42.4:28000
spark.deploy.recoveryDirectory /var/run/sparkmaster/df71911f-a28d-409d-977f-ea2e596ec578/recovery
spark.akka.logLifecycleEvents true



was (Author: aviksil@gmail.com):
I am also seeing the same issue with spark 1.3.0, ubuntu 14.04, zookeeper 3.4.5

We have a 3 node cluster with spark and zookeeper. We also have a automatic restarter service which checks for the status of spark master every 5 min and restarts it if it is not running. So when the master shuts down after its leadership is revoked, the restarter service starts spark master within 5 min. But in *few cases* we don't see any ALIVE spark master in any of the three nodes - we don't see any "We have gained leadership" message in any of the 3 nodes.

From spark-defaults.conf:

spark.deploy.recoveryMode ZOOKEEPER
spark.deploy.zookeeper.url 192.168.42.2:28000,192.168.42.3:28000,192.168.42.4:28000
spark.deploy.recoveryDirectory /var/run/sparkmaster/df71911f-a28d-409d-977f-ea2e596ec578/recovery
spark.akka.logLifecycleEvents true


> Bouncing Zookeeper node causes Active spark master to exit
> ----------------------------------------------------------
>
>                 Key: SPARK-15544
>                 URL: https://issues.apache.org/jira/browse/SPARK-15544
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>            Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x154dfc0426b0054, likely server has closed socket, closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x254c701f28d0053, likely server has closed socket, closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org