You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Yunus Olgun <yu...@gmail.com> on 2018/05/15 22:46:45 UTC

Flink recovers even cancelled jobs after zookeeper failure

Hi,
We are using Flink 1.4.0 at zookeeper high availability mode and with
externalized checkpoints. Today after we have restarted a zookeeper node,
several Flink clusters have lost connection to the zookeeper. This
triggered a leader election at effected clusters. After the leader
election, the new leader recovered running jobs. Up to this point,
everything is good.

Then the new leader also has restarted old canceled jobs. This seems to be
a bug and caused problems. A snippet from the logs is at the bottom of this
email.

Is this a known problem? If it helps, we usually cancel through Flink UI.
Also, any ideas on how to prevent such a thing from happening again are
appreciated.

Regards,
Yunus


2018-05-15 16:31:50,332 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable
to read additional data from server sessionid 0x15f58694b13bcda, likely
server has closed socket, closing socket connection and attempting reconnect
2018-05-15 16:31:50,434 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: SUSPENDED
2018-05-15 16:31:50,434 WARN
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper suspended. The contender akka.tcp://flink@ino10
:44855
1/user/jobmanager no longer participates in the leader election.
2018-05-15 16:31:50,434 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
2018-05-15 16:31:50,434 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.
2018-05-15 16:31:50,434 WARN
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not
monn
itored (temporarily).
2018-05-15 16:31:50,496 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
conff
iguration section named 'Client' was found in specified JAAS configuration
file: '/tmp/jaas-5118766953326230150.conf'. Will continue connection to
Zookeeper server without SASL autthentication, if Zookeeper server allows
it.
2018-05-15 16:31:50,496 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181
2018-05-15 16:31:50,496 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2018-05-15 16:31:50,496 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiatt
ing session
2018-05-15 16:31:50,498 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server nebula-0.xxxxx.lan/
172.30.0.100:2181,,
 sessionid = 0x15f58694b13bcda, negotiated timeout = 60000
onfiguration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-5118766953326230150.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
2018-05-15 16:31:50,496 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server nebula-0.xxxxx.lan/172.30.0.100:2181
2018-05-15 16:31:50,496 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2018-05-15 16:31:50,496 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to nebula-0.xxxxx.lan/172.30.0.100:2181, initiating
session
2018-05-15 16:31:50,498 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server nebula-0.xxxxx.lan/
172.30.0.100:2181, sessionid = 0x15f58694b13bcda, negotiated timeout = 60000
2018-05-15 16:31:50,498 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED
2018-05-15 16:31:50,498 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2018-05-15 16:31:50,499 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-05-15 16:31:50,499 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2018-05-15 16:31:50,499 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
monitored again.
2018-05-15 16:31:51,364 INFO
org.apache.flink.runtime.jobmanager.JobManager                - JobManager
akka.tcp://flink@ino10:44851/user/jobmanager was granted leadership with
leader session ID Some(0c6e1947-c2a6-45cf-b4c6-e006357affb6).
2018-05-15 16:31:51,372 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Delaying
recovery of all jobs by 10000 milliseconds.
2018-05-15 16:31:51,375 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- Trying to associate with JobManager leader akka.tcp://flink@ino10
:44851/user/jobmanager
2018-05-15 16:31:51,375 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- Resource Manager associating with leading JobManager
Actor[akka://flink/user/jobmanager#692352189] - leader session
0c6e1947-c2a6-45cf-b4c6-e006357affb6
2018-05-15 16:31:51,399 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- TaskManager 09d5caf33316cec17d6f2d089ed5bd94 has started.
2018-05-15 16:31:51,399 INFO
org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at ino11 (akka.tcp://flink@ino11:42603/user/taskmanager) as
dfe46e4854ee1e2f5072783d61cfbb81. Current number of registered hosts is 1.
Current number of alive task slots is 8.
2018-05-15 16:31:51,401 INFO
org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at ino11 (akka.tcp://flink@ino11:45607/user/taskmanager) as
ca2881ee6ab11f9b5ffb7acb249494f5. Current number of registered hosts is 2.
Current number of alive task slots is 16.
2018-05-15 16:31:51,402 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- TaskManager 659642a31157802db9de994f7d6669e5 has started.
2018-05-15 16:31:51,413 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- TaskManager 515a0db8bd16ed1dfc23332cebafe92f has started.
2018-05-15 16:31:51,413 INFO
org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at ino10 (akka.tcp://flink@ino10:33003/user/taskmanager) as
30d3491cf7edf3c48bc9498558052e63. Current number of registered hosts is 3.
Current number of alive task slots is 24.
2018-05-15 16:31:51,430 INFO
org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
- TaskManager b938f5ba1e40e17f60bc64e9bc39d79a has started.
2018-05-15 16:31:51,430 INFO
org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at ino10 (akka.tcp://flink@ino10:40339/user/taskmanager) as
0b01240b5169f94d2b870a3fd25f41df. Current number of registered hosts is 4.
Current number of alive task slots is 32.
2018-05-15 16:32:01,389 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover all jobs.

2018-05-15 16:32:01,391 INFO
org.apache.flink.runtime.jobmanager.JobManager                - There are 7
jobs to recover. Starting the job recovery.
2018-05-15 16:32:01,391 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job cd8ba3012f3eefcae4dea2f1487fe0e2.
2018-05-15 16:32:01,391 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job 8fdf7c630f47836fc2ce615fff96dada.
2018-05-15 16:32:01,393 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job 6b1117a5b7873d0cf68e084404ba750b.
2018-05-15 16:32:01,394 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job cca8fe39a515c63d407c83232e009bd8.
2018-05-15 16:32:01,395 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job 5fd8f4d97c91c4c66c6748457ab8b66b.
2018-05-15 16:32:01,396 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job d87590afbb9fb8bc24a4ef38729a9967.
2018-05-15 16:32:01,396 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Attempting
to recover job b74b14696f637126b60a079b7a7d4710.