You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Marcus Clendenin <ma...@gmail.com> on 2017/09/14 14:57:43 UTC

Taskmanager unable to rejoin job manager

Hi all,



I am having an issue where one of our task managers that is running in high
availability mode is timing out on the connection to zookeeper. This is
causing it to retry the connection to zookeeper, which succeeds. The issue
is once the taskmanager is back connected to zookeeper it is then unable to
connect to the Job manager. Does anybody know why this is happening? This
is on flink 1.3.1 with checkpointing using RocksDB



Stack Trace:

2017-09-14 09:35:16,033 INFO
org.apache.zookeeper.ClientCnxn                               - Client
session timed out, have not heard from server in 79531ms for sessionid
0x15e428f9953001f, closing socket connection and attempting reconnect

2017-09-14 09:35:17,170 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
- State change: SUSPENDED

2017-09-14 09:35:17,528 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper suspended. Can no longer retrieve the leader from
ZooKeeper.

2017-09-14 09:35:17,796 WARN
org.apache.zookeeper.ClientCnxn                               - SASL
configuration failed: javax.security.auth.login.LoginException: unable to
find LoginModule class:
org.apache.kafka.common.security.plain.PlainLoginModule Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.

2017-09-14 09:35:17,796 INFO
org.apache.zookeeper.ClientCnxn                               - Opening
socket connection to server zookeeper21-01/00.000.00.000:2181

2017-09-14 09:35:17,798 INFO
org.apache.zookeeper.ClientCnxn                               - Socket
connection established to zookeeper21-01/00.000.00.000:2181, initiating
session

2017-09-14 09:35:17,958 ERROR
org.apache.flink.shaded.org.apache.curator.ConnectionState    -
Authentication failed

2017-09-14 09:35:18,261 WARN
akka.remote.RemoteWatcher                                     - Detected
unreachable: [akka.tcp://flink@jobmanager1:36491]

2017-09-14 09:35:18,433 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
- State change: LOST

2017-09-14 09:35:18,433 INFO
org.apache.zookeeper.ClientCnxn                               - Unable to
reconnect to ZooKeeper service, session 0x15e428f9953001f has expired,
closing socket connection

2017-09-14 09:35:18,433 WARN
org.apache.flink.shaded.org.apache.curator.ConnectionState    - Session
expired event received

2017-09-14 09:35:18,433 WARN
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper lost. Can no longer retrieve the leader from
ZooKeeper.

2017-09-14 09:35:18,693 INFO
org.apache.zookeeper.ZooKeeper                                - Initiating
client connection,
connectString=zookeeper21-01:2181,zookeeper21-02:2181,zookeeper21-03:2181,zookeeper22-01:2181,zookeeper22-02:2181
sessionTimeout=60000
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@781f10f2

2017-09-14 09:35:18,757 INFO
org.apache.zookeeper.ClientCnxn                               - EventThread
shut down

2017-09-14 09:35:19,354 WARN
org.apache.zookeeper.ClientCnxn                               - SASL
configuration failed: javax.security.auth.login.LoginException: unable to
find LoginModule class:
org.apache.kafka.common.security.plain.PlainLoginModule Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.

2017-09-14 09:35:19,354 INFO
org.apache.zookeeper.ClientCnxn                               - Opening
socket connection to server zookeeper1/00.000.00.000:2181

2017-09-14 09:35:19,354 ERROR
org.apache.flink.shaded.org.apache.curator.ConnectionState    -
Authentication failed

2017-09-14 09:35:19,355 INFO
org.apache.zookeeper.ClientCnxn                               - Socket
connection established to zookeeper1/00.000.00.000:2181, initiating session

2017-09-14 09:35:19,358 INFO
org.apache.zookeeper.ClientCnxn                               - Session
establishment complete on server zookeeper1/00.000.00.000:2181, sessionid =
0x45e446247000012, negotiated timeout = 60000

2017-09-14 09:35:19,358 INFO
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED

2017-09-14 09:35:19,359 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.

2017-09-14 09:35:21,494 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
akka://flink/user/taskmanager disconnects from JobManager
akka.tcp://flink@jobmanager1:36491/user/jobmanager: JobManager is no longer
reachable

2017-09-14 09:35:21,724 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling
all computations and discarding all cached data.

2017-09-14 09:35:21,856 INFO
org.apache.flink.runtime.taskmanager.Task                     - Attempting
to fail task externally Map (2/3) (13599aa15283f8c5af1df477cd290629).

2017-09-14 09:35:21,856 INFO
org.apache.flink.runtime.taskmanager.Task                     - Map (2/3)
(13599aa15283f8c5af1df477cd290629) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
JobManager is no longer reachable

at
org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)

        at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)

        at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

        at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

        at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

        at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

        at
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

        at
org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-09-14 09:35:21,861 INFO
org.apache.flink.runtime.taskmanager.Task                     - Triggering
cancellation of task code Map (2/3) (13599aa15283f8c5af1df477cd290629).

2017-09-14 09:35:21,861 INFO
org.apache.flink.runtime.taskmanager.Task                     - Attempting
to fail task externally Timestamps/Watermarks (2/3)
(9cf3d208a85e4d88fffd93d0b8152d83).

2017-09-14 09:35:21,861 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Timestamps/Watermarks (2/3) (9cf3d208a85e4d88fffd93d0b8152d83) switched
from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
JobManager is no longer reachable

        at
org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)

Re: Taskmanager unable to rejoin job manager

Posted by Mar_zieh <m....@gmail.com>.

Hello 

I want to run flink on apache Mesos with Marathon and I configure Zookeeper
too; so I run "mesos-appmaster.sh"; but it shows me this error: 

2019-04-25 13:53:18,160 INFO 
org.apache.flink.mesos.runtime.clusterframework.MesosResourceManager  -
Mesos resource manager started.
2019-04-25 13:53:23,176 WARN 
org.apache.flink.mesos.scheduler.ConnectionMonitor            - Unable to
connect to Mesos; still trying...
2019-04-25 13:53:28,194 WARN 
org.apache.flink.mesos.scheduler.ConnectionMonitor            - Unable to
connect to Mesos; still trying...
2019-04-25 13:53:33,214 WARN 
org.apache.flink.mesos.scheduler.ConnectionMonitor            - Unable to
connect to Mesos; still trying...
2019-04-25 13:53:38,234 WARN 
org.apache.flink.mesos.scheduler.ConnectionMonitor            - Unable to
connect to Mesos; still trying...


Would you please tell me how to solve this error?

Many thanks.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Taskmanager unable to rejoin job manager

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Marcus,

thanks for reaching out with your problem.
I'm not very experienced with the HA setup, but Till (in CC) might be able
to help you.

Best, Fabian

2017-09-14 16:57 GMT+02:00 Marcus Clendenin <ma...@gmail.com>:

> Hi all,
>
>
>
> I am having an issue where one of our task managers that is running in
> high availability mode is timing out on the connection to zookeeper. This
> is causing it to retry the connection to zookeeper, which succeeds. The
> issue is once the taskmanager is back connected to zookeeper it is then
> unable to connect to the Job manager. Does anybody know why this is
> happening? This is on flink 1.3.1 with checkpointing using RocksDB
>
>
>
> Stack Trace:
>
> 2017-09-14 09:35:16,033 INFO  org.apache.zookeeper.
> ClientCnxn                               - Client session timed out, have
> not heard from server in 79531ms for sessionid 0x15e428f9953001f, closing
> socket connection and attempting reconnect
>
> 2017-09-14 09:35:17,170 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> SUSPENDED
>
> 2017-09-14 09:35:17,528 WARN  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can
> no longer retrieve the leader from ZooKeeper.
>
> 2017-09-14 09:35:17,796 WARN  org.apache.zookeeper.
> ClientCnxn                               - SASL configuration failed:
> javax.security.auth.login.LoginException: unable to find LoginModule
> class: org.apache.kafka.common.security.plain.PlainLoginModule Will
> continue connection to Zookeeper server without SASL authentication, if
> Zookeeper server allows it.
>
> 2017-09-14 09:35:17,796 INFO  org.apache.zookeeper.
> ClientCnxn                               - Opening socket connection to
> server zookeeper21-01/00.000.00.000:2181
>
> 2017-09-14 09:35:17,798 INFO  org.apache.zookeeper.ClientCnxn
>            - Socket connection established to zookeeper21-01/00.000.00.
> 000:2181, initiating session
>
> 2017-09-14 09:35:17,958 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Authentication failed
>
> 2017-09-14 09:35:18,261 WARN  akka.remote.RemoteWatcher
>                                 - Detected unreachable:
> [akka.tcp://flink@jobmanager1:36491]
>
> 2017-09-14 09:35:18,433 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> LOST
>
> 2017-09-14 09:35:18,433 INFO  org.apache.zookeeper.
> ClientCnxn                               - Unable to reconnect to
> ZooKeeper service, session 0x15e428f9953001f has expired, closing socket
> connection
>
> 2017-09-14 09:35:18,433 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Session expired event received
>
> 2017-09-14 09:35:18,433 WARN  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper lost. Can no
> longer retrieve the leader from ZooKeeper.
>
> 2017-09-14 09:35:18,693 INFO  org.apache.zookeeper.
> ZooKeeper                                - Initiating client connection,
> connectString=zookeeper21-01:2181,zookeeper21-02:2181,zookeeper21-03:2181,
> zookeeper22-01:2181,zookeeper22-02:2181 sessionTimeout=60000
> watcher=org.apache.flink.shaded.org.apache.curator.
> ConnectionState@781f10f2
>
> 2017-09-14 09:35:18,757 INFO  org.apache.zookeeper.
> ClientCnxn                               - EventThread shut down
>
> 2017-09-14 09:35:19,354 WARN  org.apache.zookeeper.
> ClientCnxn                               - SASL configuration failed:
> javax.security.auth.login.LoginException: unable to find LoginModule
> class: org.apache.kafka.common.security.plain.PlainLoginModule Will
> continue connection to Zookeeper server without SASL authentication, if
> Zookeeper server allows it.
>
> 2017-09-14 09:35:19,354 INFO  org.apache.zookeeper.
> ClientCnxn                               - Opening socket connection to
> server zookeeper1/00.000.00.000:2181
>
> 2017-09-14 09:35:19,354 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState
> - Authentication failed
>
> 2017-09-14 09:35:19,355 INFO  org.apache.zookeeper.
> ClientCnxn                               - Socket connection established
> to zookeeper1/00.000.00.000:2181, initiating session
>
> 2017-09-14 09:35:19,358 INFO  org.apache.zookeeper.
> ClientCnxn                               - Session establishment complete
> on server zookeeper1/00.000.00.000:2181, sessionid = 0x45e446247000012,
> negotiated timeout = 60000
>
> 2017-09-14 09:35:19,358 INFO  org.apache.flink.shaded.org.
> apache.curator.framework.state.ConnectionStateManager  - State change:
> RECONNECTED
>
> 2017-09-14 09:35:19,359 INFO  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper was
> reconnected. Leader retrieval can be restarted.
>
> 2017-09-14 09:35:21,494 INFO  org.apache.flink.runtime.
> taskmanager.TaskManager              - TaskManager
> akka://flink/user/taskmanager disconnects from JobManager
> akka.tcp://flink@jobmanager1:36491/user/jobmanager: JobManager is no
> longer reachable
>
> 2017-09-14 09:35:21,724 INFO  org.apache.flink.runtime.
> taskmanager.TaskManager              - Cancelling all computations and
> discarding all cached data.
>
> 2017-09-14 09:35:21,856 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Attempting to fail task externally
> Map (2/3) (13599aa15283f8c5af1df477cd290629).
>
> 2017-09-14 09:35:21,856 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Map (2/3) (
> 13599aa15283f8c5af1df477cd290629) switched from RUNNING to FAILED.
>
> java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
> from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
> JobManager is no longer reachable
>
> at org.apache.flink.runtime.taskmanager.TaskManager.
> handleJobManagerDisconnect(TaskManager.scala:1095)
>
>         at org.apache.flink.runtime.taskmanager.TaskManager$$
> anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
>
>         at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
>
>         at org.apache.flink.runtime.LeaderSessionMessageFilter$$
> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>
>         at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.apply(
> LogMessages.scala:33)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.apply(
> LogMessages.scala:28)
>
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.
> scala:123)
>
>         at org.apache.flink.runtime.LogMessages$$anon$1.
> applyOrElse(LogMessages.scala:28)
>
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
>         at org.apache.flink.runtime.taskmanager.TaskManager.
> aroundReceive(TaskManager.scala:120)
>
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>         at akka.actor.dungeon.DeathWatch$class.receivedTerminated(
> DeathWatch.scala:44)
>
>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
>         at akka.dispatch.ForkJoinExecutorConfigurator$
> AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(
> ForkJoinTask.java:260)
>
>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
> runTask(ForkJoinPool.java:1339)
>
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
> ForkJoinPool.java:1979)
>
>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java:107)
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Triggering cancellation of task
> code Map (2/3) (13599aa15283f8c5af1df477cd290629).
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Attempting to fail task externally
> Timestamps/Watermarks (2/3) (9cf3d208a85e4d88fffd93d0b8152d83).
>
> 2017-09-14 09:35:21,861 INFO  org.apache.flink.runtime.
> taskmanager.Task                     - Timestamps/Watermarks (2/3) (
> 9cf3d208a85e4d88fffd93d0b8152d83) switched from RUNNING to FAILED.
>
> java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects
> from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager:
> JobManager is no longer reachable
>
>         at org.apache.flink.runtime.taskmanager.TaskManager.
> handleJobManagerDisconnect(TaskManager.scala:1095)
>