You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zili Chen (Jira)" <ji...@apache.org> on 2020/01/03 03:23:00 UTC
[jira] [Commented] (FLINK-14091) Job can not trigger checkpoint
forever after zookeeper change leader
[ https://issues.apache.org/jira/browse/FLINK-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007197#comment-17007197 ]
Zili Chen commented on FLINK-14091:
-----------------------------------
It's a known issue we also faced internally. I have a fix and will push a pull request later today.
cc [~trohrmann]
> Job can not trigger checkpoint forever after zookeeper change leader
> ---------------------------------------------------------------------
>
> Key: FLINK-14091
> URL: https://issues.apache.org/jira/browse/FLINK-14091
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.9.0
> Reporter: Peng Wang
> Assignee: Zili Chen
> Priority: Critical
>
> when zk change leader, the state of curator is suspended,job manager can not tigger checkpoint.but it doesn't tigger checkpoint after zk resume.
> we found that the lastState in the class ZooKeeperCheckpointIDCounter never change back to normal when it fall into SUSPENDED or LOST.
> h6. _/**_
> _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} or {@link_
> _* ConnectionState#LOST} we are not guaranteed to read a current count from ZooKeeper._
> _*/_
> _private static class SharedCountConnectionStateListener implements ConnectionStateListener {_
> _private volatile ConnectionState lastState;_
> _@Override_
> _public void stateChanged(CuratorFramework client, ConnectionState newState) {_
> _if (newState == ConnectionState.SUSPENDED || newState == ConnectionState.LOST) {_
> _lastState = newState;_
> _}_
> _}_
> _private ConnectionState getLastState() {_
> _return lastState;_
> _}_
> _}_
>
> we change the state back. after test, solve the problem.
>
> h6. _/**_
> _* Connection state listener. In case of \{@link ConnectionState#SUSPENDED} or {@link_
> _* ConnectionState#LOST} we are not guaranteed to read a current count from ZooKeeper._
> _*/_
> _private static class SharedCountConnectionStateListener implements ConnectionStateListener {_
> _private volatile ConnectionState lastState;_
> _@Override_
> _public void stateChanged(CuratorFramework client, ConnectionState newState) {_
> _if (newState == ConnectionState.SUSPENDED || newState == ConnectionState.LOST) {_
> _lastState = newState;_
> _}_
> _else{_
> _/* if connectionState is not SUSPENDED and LOST, reset lastState. */_
> _lastState = null;_
> _}_
> _}_
> _private ConnectionState getLastState() {_
> _return lastState;_
> _}_
> _}_
>
> log:
> h6. {{{{2019-09-16 13:38:38,020 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, likely server has closed socket, closing socket connection and attempting reconnect}}}}{{{{2019-09-16 13:38:38,122 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED}}}}{{{{2019-09-16 13:38:38,123 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:38,126 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/dispatcher}} {{no longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/resourcemanager}} {{no longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http:}}{{//node007224}}{{:8081 no longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp:}}{{//flink}}{{@node007224:19115}}{{/user/jobmanager_2}} {{no longer participates }}{{in}} {{the leader election.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:38,128 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.}}}}{{{{2019-09-16 13:38:39,109 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified JAAS configuration }}{{file}}{{: }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} {{connection to Zookeeper server without SASL authentication, }}{{if}} {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,109 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 192.168.7.231}}{{/192}}{{.168.7.231:2181}}}}{{{{2019-09-16 13:38:39,109 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed}}}}{{{{2019-09-16 13:38:39,110 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 192.168.7.231}}{{/192}}{{.168.7.231:2181, initiating session}}}}{{{{2019-09-16 13:38:39,112 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to }}{{read}} {{additional data from server sessionid 0x26cff6487c2000e, likely server has closed socket, closing socket connection and attempting reconnect}}}}{{{{2019-09-16 13:38:39,778 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named }}{{'Client'}} {{was found }}{{in}} {{specified JAAS configuration }}{{file}}{{: }}{{'/tmp/jaas-4823064314619540149.conf'}}{{. Will }}{{continue}} {{connection to Zookeeper server without SASL authentication, }}{{if}} {{Zookeeper server allows it.}}}}{{{{2019-09-16 13:38:39,778 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 192.168.7.230}}{{/192}}{{.168.7.230:2181}}}}{{{{2019-09-16 13:38:39,778 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed}}}}{{{{2019-09-16 13:38:39,778 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 192.168.7.230}}{{/192}}{{.168.7.230:2181, initiating session}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 192.168.7.230}}{{/192}}{{.168.7.230:2181, sessionid = 0x26cff6487c2000e, negotiated timeout = 60000}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.}}}}{{{{2019-09-16 13:38:39,780 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:39,781 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.}}}}{{{{2019-09-16 13:38:43,142 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 6995 }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9 (16841 bytes }}{{in}} {{49 ms).}}}}{{{{2019-09-16 13:38:43,144 ERROR org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Exception }}{{while}} {{triggering checkpoint }}{{for}} {{job 21b6ef566750f5766443641254e8e1a9.}}}}{{{{java.lang.IllegalStateException: Connection state: SUSPENDED}}}}{{{{ }}{{at org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.checkConnectionState(ZooKeeperCheckpointIDCounter.java:159)}}}}{{{{ }}{{at org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter.get(ZooKeeperCheckpointIDCounter.java:133)}}}}{{{{ }}{{at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:448)}}}}{{{{ }}{{at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$ScheduledTrigger.run(CheckpointCoordinator.java:1323)}}}}{{{{ }}{{at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}}}{{{{ }}{{at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)}}}}{{{{ }}{{at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)}}}}{{{{ }}{{at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)}}}}{{{{ }}{{at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)}}}}{{{{ }}{{at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)}}}}{{{{ }}{{at java.lang.Thread.run(Thread.java:745)}}}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)