You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Steven Wu <st...@gmail.com> on 2019/03/01 23:47:10 UTC

[1.7.1] job stuck in suspended state

We have observe that sometimes job stuck in suspended state, and no job
restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
continue connection to Zookeeper server without SASL authentication, if
Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2019-02-28 19:04:36,357 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
connection established to 100.82.141.106/100.82.141.106:2181, initiating
session
2019-02-28 19:04:36,359 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server 100.82.141.106/100.82.141.106:2181,
sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
monitored again.
2019-02-28 19:04:36,360 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
        at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
        at
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
        at
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Suspending
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Shutting down.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Close ResourceManager connection
9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
SlotPool.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
yet. Discarding message
org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
is started.
2019-02-28 19:05:09,448 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Stopping the JobMaster for job
cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
SlotPool.
2019-02-28 19:05:09,449 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Re: [1.7.1] job stuck in suspended state

Posted by Till Rohrmann <tr...@apache.org>.
Hi Steven,

I think I found the problem. It is caused by a JobMaster which takes a long
time to suspend the job and multiple leader changes. So what happens after
the first leadership revoking and regaining is that the Dispatcher recovers
the submitted job but waits to execute it because the JobMaster from the
previous leader session has not been stopped yet. This is done by
scheduling some futures with the current main thread executor. Now what
happens next is another cycle of leadership revoking and regaining (3rd
time that the Dispatcher receives the leadership). Due to this, we create a
new main thread executor and invalidate the previous one. As a consequence
the created future from the 2nd leadership will never be executed. This
alone is not a problem but since we serialize the leadership gaining
operations, we actually wait for this future to complete before starting
the job recovery belonging to the 3rd leadership.

Long story short: FLINK-11537 should make this problem very unlikely to
occur because the JobMaster termination is now almost instantaneous.
However it does not fully fix the underlying problem. Here is the JIRA
issue for fixing the problem:
https://issues.apache.org/jira/browse/FLINK-11843

Cheers,
Till

On Wed, Mar 6, 2019 at 1:21 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Steven,
>
> a quick update from my side after looking through the logs. The problem
> seems to be that the Dispatcher does not start recovering the jobs after
> regaining the leadership after it lost it before. I cannot yet tell why
> this is happening and I try to further debug the problem.
>
> If you manage to reproduce the problem, could you maybe run the cluster
> with DEBUG log levels and send me again the logs? If I don't manage to
> figure out how this problem happens until then, it would be super helpful.
>
> Cheers,
> Till
>
> On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <st...@gmail.com> wrote:
>
>> Till,
>>
>> I will send you the complete log offline. We don't know how to reliably
>> reproduce the problem. but it did happen quite frequently, like once every
>> a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
>> branch.
>>
>> Thanks,
>> Steven
>>
>>
>> On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Steven,
>>>
>>> is this the tail of the logs or are there other statements following? I
>>> think your problem could indeed be related to FLINK-11537. Is it possible
>>> to somehow reliably reproduce this problem? If yes, then you could try out
>>> the RC for Flink 1.8.0 which should be published in the next days.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <st...@gmail.com> wrote:
>>>
>>>> We have observe that sometimes job stuck in suspended state, and no job
>>>> restart/recover were attempted once job is suspended.
>>>> * it is a high-parallelism job (like close to 2,000)
>>>> * there were a few job restarts before this
>>>> * there were high GC pause during the period
>>>> * zookeeper timeout. probably caused by high GC pause
>>>>
>>>> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>>>>
>>>> I pasted some logs in the end.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> 2019-02-28 19:04:36,357 WARN
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>>> configuration section named 'Client' was found in speci
>>>> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'.
>>>> Will continue connection to Zookeeper server without SASL authentication,
>>>> if Zookeeper server allows it.
>>>> 2019-02-28 19:04:36,357 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
>>>> 2019-02-28 19:04:36,357 ERROR
>>>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>>>> Authentication failed
>>>> 2019-02-28 19:04:36,357 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>>> connection established to 100.82.141.106/100.82.141.106:2181,
>>>> initiating session
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>> Session establishment complete on server
>>>> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
>>>> negotiated timeout = 40000
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>>> - State change: RECONNECTED
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>>> 2019-02-28 19:04:36,359 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>>>> monitored again.
>>>> 2019-02-28 19:04:36,360 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
>>>> ...
>>>> 2019-02-28 19:05:09,400 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>>> 2019-02-28 19:05:09,400 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>>> RESTARTING to SUSPENDING.
>>>> org.apache.flink.util.FlinkException: JobManager is no longer the
>>>> leader.
>>>>         at
>>>> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>>>>         at
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>>>>         at
>>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>         at java.lang.Thread.run(Thread.java:748)
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>>> SUSPENDING to SUSPENDED.
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
>>>> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
>>>> 2019-02-28 19:05:09,403 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>>> Suspending
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Shutting down.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
>>>> ...
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
>>>> no longer the leader..
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
>>>> SlotPool.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>>> yet. Discarding message
>>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>>> is started.
>>>> 2019-02-28 19:05:09,448 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>>>> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
>>>> 2019-02-28 19:05:09,449 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>>>> SlotPool.
>>>> 2019-02-28 19:05:09,449 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>>> Stopping ZooKeeperLeaderElectionService
>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
>>>> 2019-02-28 19:05:09,452 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.
>>>>
>>>

Re: [1.7.1] job stuck in suspended state

Posted by Till Rohrmann <tr...@apache.org>.
Hi Steven,

a quick update from my side after looking through the logs. The problem
seems to be that the Dispatcher does not start recovering the jobs after
regaining the leadership after it lost it before. I cannot yet tell why
this is happening and I try to further debug the problem.

If you manage to reproduce the problem, could you maybe run the cluster
with DEBUG log levels and send me again the logs? If I don't manage to
figure out how this problem happens until then, it would be super helpful.

Cheers,
Till

On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <st...@gmail.com> wrote:

> Till,
>
> I will send you the complete log offline. We don't know how to reliably
> reproduce the problem. but it did happen quite frequently, like once every
> a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
> branch.
>
> Thanks,
> Steven
>
>
> On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi Steven,
>>
>> is this the tail of the logs or are there other statements following? I
>> think your problem could indeed be related to FLINK-11537. Is it possible
>> to somehow reliably reproduce this problem? If yes, then you could try out
>> the RC for Flink 1.8.0 which should be published in the next days.
>>
>> Cheers,
>> Till
>>
>> On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <st...@gmail.com> wrote:
>>
>>> We have observe that sometimes job stuck in suspended state, and no job
>>> restart/recover were attempted once job is suspended.
>>> * it is a high-parallelism job (like close to 2,000)
>>> * there were a few job restarts before this
>>> * there were high GC pause during the period
>>> * zookeeper timeout. probably caused by high GC pause
>>>
>>> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>>>
>>> I pasted some logs in the end.
>>>
>>> Thanks,
>>> Steven
>>>
>>> 2019-02-28 19:04:36,357 WARN
>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>>> configuration section named 'Client' was found in speci
>>> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
>>> continue connection to Zookeeper server without SASL authentication, if
>>> Zookeeper server allows it.
>>> 2019-02-28 19:04:36,357 INFO
>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
>>> 2019-02-28 19:04:36,357 ERROR
>>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>>> Authentication failed
>>> 2019-02-28 19:04:36,357 INFO
>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>>> connection established to 100.82.141.106/100.82.141.106:2181,
>>> initiating session
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>> Session establishment complete on server
>>> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
>>> negotiated timeout = 40000
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>> - State change: RECONNECTED
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>>> 2019-02-28 19:04:36,359 INFO
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>>> monitored again.
>>> 2019-02-28 19:04:36,360 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
>>> ...
>>> 2019-02-28 19:05:09,400 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>> 2019-02-28 19:05:09,400 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>> RESTARTING to SUSPENDING.
>>> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
>>>         at
>>> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>>>         at
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>>>         at
>>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> 2019-02-28 19:05:09,403 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>>> SUSPENDING to SUSPENDED.
>>> 2019-02-28 19:05:09,403 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
>>> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
>>> 2019-02-28 19:05:09,403 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Suspending
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Shutting down.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
>>> ...
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
>>> no longer the leader..
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
>>> SlotPool.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>>> yet. Discarding message
>>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>>> is started.
>>> 2019-02-28 19:05:09,448 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>>> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
>>> 2019-02-28 19:05:09,449 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>>> SlotPool.
>>> 2019-02-28 19:05:09,449 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Stopping ZooKeeperLeaderElectionService
>>> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
>>> 2019-02-28 19:05:09,452 INFO
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.
>>>
>>

Re: [1.7.1] job stuck in suspended state

Posted by Steven Wu <st...@gmail.com>.
Till,

I will send you the complete log offline. We don't know how to reliably
reproduce the problem. but it did happen quite frequently, like once every
a couple of days. Let me see if I can cherry pick the fix/commit to 1.7
branch.

Thanks,
Steven


On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Steven,
>
> is this the tail of the logs or are there other statements following? I
> think your problem could indeed be related to FLINK-11537. Is it possible
> to somehow reliably reproduce this problem? If yes, then you could try out
> the RC for Flink 1.8.0 which should be published in the next days.
>
> Cheers,
> Till
>
> On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <st...@gmail.com> wrote:
>
>> We have observe that sometimes job stuck in suspended state, and no job
>> restart/recover were attempted once job is suspended.
>> * it is a high-parallelism job (like close to 2,000)
>> * there were a few job restarts before this
>> * there were high GC pause during the period
>> * zookeeper timeout. probably caused by high GC pause
>>
>> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>>
>> I pasted some logs in the end.
>>
>> Thanks,
>> Steven
>>
>> 2019-02-28 19:04:36,357 WARN
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in speci
>> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
>> continue connection to Zookeeper server without SASL authentication, if
>> Zookeeper server allows it.
>> 2019-02-28 19:04:36,357 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
>> 2019-02-28 19:04:36,357 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-02-28 19:04:36,357 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
>> connection established to 100.82.141.106/100.82.141.106:2181, initiating
>> session
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Session establishment complete on server
>> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
>> negotiated timeout = 40000
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>> - State change: RECONNECTED
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
>> 2019-02-28 19:04:36,359 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>> monitored again.
>> 2019-02-28 19:04:36,360 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
>> ...
>> 2019-02-28 19:05:09,400 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 2019-02-28 19:05:09,400 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>> RESTARTING to SUSPENDING.
>> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
>>         at
>> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>>         at
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>>         at
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
>> SUSPENDING to SUSPENDED.
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
>> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
>> 2019-02-28 19:05:09,403 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Suspending
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Shutting down.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
>> ...
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
>> no longer the leader..
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
>> SlotPool.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
>> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
>> yet. Discarding message
>> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
>> is started.
>> 2019-02-28 19:05:09,448 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
>> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
>> 2019-02-28 19:05:09,449 INFO
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
>> SlotPool.
>> 2019-02-28 19:05:09,449 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Stopping ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
>> 2019-02-28 19:05:09,452 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.
>>
>

Re: [1.7.1] job stuck in suspended state

Posted by Till Rohrmann <tr...@apache.org>.
Hi Steven,

is this the tail of the logs or are there other statements following? I
think your problem could indeed be related to FLINK-11537. Is it possible
to somehow reliably reproduce this problem? If yes, then you could try out
the RC for Flink 1.8.0 which should be published in the next days.

Cheers,
Till

On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <st...@gmail.com> wrote:

> We have observe that sometimes job stuck in suspended state, and no job
> restart/recover were attempted once job is suspended.
> * it is a high-parallelism job (like close to 2,000)
> * there were a few job restarts before this
> * there were high GC pause during the period
> * zookeeper timeout. probably caused by high GC pause
>
> Is it related to https://issues.apache.org/jira/browse/FLINK-11537?
>
> I pasted some logs in the end.
>
> Thanks,
> Steven
>
> 2019-02-28 19:04:36,357 WARN
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named 'Client' was found in speci
> fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will
> continue connection to Zookeeper server without SASL authentication, if
> Zookeeper server allows it.
> 2019-02-28 19:04:36,357 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Opening socket connection to server 100.82.141.106/100.82.141.106:2181
> 2019-02-28 19:04:36,357 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Authentication failed
> 2019-02-28 19:04:36,357 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket
> connection established to 100.82.141.106/100.82.141.106:2181, initiating
> session
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Session establishment complete on server
> 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2,
> negotiated timeout = 40000
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: RECONNECTED
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
> 2019-02-28 19:04:36,359 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
> monitored again.
> 2019-02-28 19:04:36,360 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
> ...
> 2019-02-28 19:05:09,400 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-02-28 19:05:09,400 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
> RESTARTING to SUSPENDING.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
>         at
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
>         at
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2019-02-28 19:05:09,403 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state
> SUSPENDING to SUSPENDED.
> 2019-02-28 19:05:09,403 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping
> checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
> 2019-02-28 19:05:09,403 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
> Suspending
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Shutting down.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
> ...
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
> ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is
> no longer the leader..
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
> SlotPool.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - The rpc
> endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started
> yet. Discarding message
> org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing
> is started.
> 2019-02-28 19:05:09,448 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
> the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
> 2019-02-28 19:05:09,449 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
> SlotPool.
> 2019-02-28 19:05:09,449 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
> 2019-02-28 19:05:09,452 INFO
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.
>