You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saloni (Jira)" <ji...@apache.org> on 2020/12/30 12:03:00 UTC

[jira] [Updated] (SPARK-33943) Zookeeper LeaderElection Agent not being called by Spark Master

     [ https://issues.apache.org/jira/browse/SPARK-33943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saloni updated SPARK-33943:
---------------------------
    Description: 
I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 zookeeper services are up), spark master is successfully able to create a zookeeper session. Until zookeeper-2 is up, it re-tries session creation. However, after both zookeeper services are up and Persistence Engine is able to successfully connect and create a session; *the ZooKeeper LeaderElection Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState

##### Only zookeeper-2 is online #####

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
  at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
  at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y 
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

##### zookeeper-2, zookeeper-3 are online ##### 

10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 
10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 
10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 
10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code}
Steps to reproduce:

Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms
 # All zookeepers and spark masters are offline
 # Online zookeeper-2
 # Online both spark-masters
 # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3
 # Online zookeeper-1

 

Questions:
 # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then?
 # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper?

  was:
I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence:
 # zookeeper-1
 # sparkmaster-1
 # sparkmaster-2
 # zookeeper-2
 # zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 zookeeper services are up), spark master is successfully able to create a zookeeper session. Until zookeeper-2 is up, it re-tries session creation. However, after both zookeeper services are up and Persistence Engine is able to successfully connect and create a session; *the ZooKeeper LeaderElection Agent is not called*.

Logs (spark-master.log):
{code:java}
10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState

##### Only zookeeper-2 is online #####

10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host
10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session
10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error)
10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 
10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
...
...
...
10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
...
...
...
10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 
10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 
10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 
10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 

##### zookeeper-2, zookeeper-3 are online ##### 

10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 
10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error) 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 
10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 
10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code}
Steps to reproduce:

Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms
 # All zookeepers and spark masters are offline
 # Online zookeeper-2
 # Online both spark-masters
 # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3
 # Online zookeeper-1

 

Questions:
 # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then?
 # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper?


> Zookeeper LeaderElection Agent not being called by Spark Master
> ---------------------------------------------------------------
>
>                 Key: SPARK-33943
>                 URL: https://issues.apache.org/jira/browse/SPARK-33943
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>         Environment: 2 Spark Masters KVMs and 3 Zookeeper KVMs.
> Operating System - RHEL 6.6
>            Reporter: Saloni
>            Priority: Major
>
> I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence:
>  # zookeeper-1
>  # sparkmaster-1
>  # sparkmaster-2
>  # zookeeper-2
>  # zookeeper-3
> The above sequence leads to both the spark masters running in STANDBY mode.
> From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 zookeeper services are up), spark master is successfully able to create a zookeeper session. Until zookeeper-2 is up, it re-tries session creation. However, after both zookeeper services are up and Persistence Engine is able to successfully connect and create a session; *the ZooKeeper LeaderElection Agent is not called*.
> Logs (spark-master.log):
> {code:java}
> 10:03:47.241 INFO org.apache.spark.internal.Logging:57 - Persisting recovery state to ZooKeeper Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState
> ##### Only zookeeper-2 is online #####
> 10:03:47.630 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
> 10:03:50.635 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host
> 10:03:50.738 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error)
> 10:03:50.739 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session
> 10:03:50.742 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
> 10:03:51.842 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error)
> 10:03:51.843 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-3:xxxx: Connection refused 
> 10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274)
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
>   at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
>   at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
> ...
> ...
> ...
> 10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss 
>   at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
>   at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
> ...
> ...
> ...
> 10:05:32.699 WARN org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 105305 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 
> 10:05:32.864 INFO org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed 10:05:32.865 INFO org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@ 
> 10:05:32.864 INFO org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for session: 0x0 
> 10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /x/y 
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
> ##### zookeeper-2, zookeeper-3 are online ##### 
> 10:05:47.357 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error) 
> 10:05:47.358 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-2:xxxx, initiating session 
> 10:05:47.359 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
> 10:05:47.528 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
> 10:05:50.529 INFO org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: zookeeper-1:xxxx: No route to host 
> 10:05:51.454 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
> 10:05:51.455 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
> 10:05:51.457 INFO org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 
> 10:05:57.564 INFO org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error) 
> 10:05:57.566 INFO org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established to zookeeper-3:xxxx, initiating session 
> 10:05:57.574 INFO org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000 
> 10:05:57.580 INFO org.apache.curator.framework.state.ConnectionStateManager:228 - State change: CONNECTED {code}
> Steps to reproduce:
> Environment: A cluster of 3 zookeepers and a cluster of 2 spark master vms
>  # All zookeepers and spark masters are offline
>  # Online zookeeper-2
>  # Online both spark-masters
>  # After around 3 mins of zookeeper-2 being onlined, online zookeeper-3
>  # Online zookeeper-1
>  
> Questions:
>  # The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then?
>  # Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org