You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gary Yao (JIRA)" <ji...@apache.org> on 2019/05/16 11:50:00 UTC
[jira] [Comment Edited] (FLINK-12384) Rolling the etcd servers
causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841209#comment-16841209 ]
Gary Yao edited comment on FLINK-12384 at 5/16/19 11:49 AM:
------------------------------------------------------------
[~haf] I checked the ZK client code and the warning is not something to be concerned about. The client is talking to a ZK server that does not support the r-o mode.
Also see:
https://github.com/apache/zookeeper/blob/e45551fc7c691332ace7bff81926855e42ac2239/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxnSocket.java#L147-L149
http://zookeeper-user.578899.n2.nabble.com/Connected-to-an-old-server-r-o-mode-will-be-unavailable-td7578775.html
Is your cluster not starting up correctly or not recovering the jobs? If that is the case, I would like to see the complete jobmanager logs if possible.
was (Author: gjy):
[~haf] I checked the ZK client code and the warning is not something to be concerned about. The client is talking to a ZK server that does not support the r-o mode.
Also see:
https://github.com/apache/zookeeper/blob/e45551fc7c691332ace7bff81926855e42ac2239/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxnSocket.java#L147-L149
http://zookeeper-user.578899.n2.nabble.com/Connected-to-an-old-server-r-o-mode-will-be-unavailable-td7578775.html
Is your cluster not starting up correctly or recovering the jobs? If that is the case, I would like to see the complete jobmanager logs if possible.
> Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-12384
> URL: https://issues.apache.org/jira/browse/FLINK-12384
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Henrik
> Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
> [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 60000
> [tm] 2019-05-01 13:30:53,525 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this resets all time windows and state, despite having that saved in GCS, so it's a crappy workaround.
>
> –
>
> flink-conf.yaml
> {code:java}
> parallelism.default: 1
> rest.address: analytics-job
> jobmanager.rpc.address: analytics-job # = resource manager's address too
> jobmanager.heap.size: 1024m
> jobmanager.rpc.port: 6123
> jobmanager.slot.request.timeout: 30000
> resourcemanager.rpc.port: 6123
> high-availability.jobmanager.port: 6123
> blob.server.port: 6124
> queryable-state.server.ports: 6125
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> web.log.path: /var/lib/log/flink/jobmanager.log
> rest.port: 8081
> rest.bind-address: 0.0.0.0
> web.submit.enable: false
> high-availability: zookeeper
> high-availability.storageDir: gs://example_analytics/flink/zetcd/
> high-availability.zookeeper.quorum: analytics-zetcd:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.client.acl: open
> state.backend: rocksdb
> state.checkpoints.num-retained: 3
> state.checkpoints.dir: gs://example_analytics/flink/checkpoints
> state.savepoints.dir: gs://example_analytics/flink/savepoints
> state.backend.incremental: true
> state.backend.async: true
> fs.hdfs.hadoopconf: /opt/flink/hadoop
> log.file: /var/lib/log/flink/jobmanager.log{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)