You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Roman Khachatryan (Jira)" <ji...@apache.org> on 2020/10/05 21:29:00 UTC
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

    [ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208335#comment-17208335 ] 

Roman Khachatryan commented on FLINK-19401:
-------------------------------------------

I do see that the same data is downloaded over and over, but I'm not sure whether it's the root cause.
 From the logs I see that heartbeat from *on JM* from TMs are expired first, and heartbeat *on TMs* expire two minutes later:
{code:java}
flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 | sort | uniq -c
     66 2020-09-18 23:57
     11 2020-09-19 01:06
flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 | sort | uniq -c
    509 2020-09-18 23:59
      2 2020-09-19 00:03
      2 2020-09-19 00:32
      2 2020-09-19 00:37
      2 2020-09-19 01:06
{code}
(77 vs 517 is due to parallelism level I think)

 

Furthermore, the first 8 recoveries start after failures in Kafka: NPE in NetworkClient and "SSLProtocolException: Handshake message sequence violation, 2".

Then they happen after Kafka failures and the aforementioned TM heartbeat timeouts on JM.

 

So it looks like that the root cause is misconfiguration and/or network.

WDYT [~trohrmann], [~stevenz3wu] ?

 

Steven, can you please also explain what does this mean:
{code:java}
$ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e

2020-09-18 23:55:43,804 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms.
2020-09-18 23:56:11,256 Deploying Source: ee_clevent_presentation -> ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,256 Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw -> (Filter -> cybertronLogReplicated, Filter -> cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:43,872 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-18 23:57:43,940 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60068 ms.
2020-09-18 23:58:44,008 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-18 23:59:11,794 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:27,101 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:44,075 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60066 ms.
2020-09-19 00:00:44,144 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:01:44,634 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60489 ms.
2020-09-19 00:02:44,710 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:03:44,799 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60088 ms.
2020-09-19 00:04:44,871 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:05:44,938 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms.
2020-09-19 00:06:45,023 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:07:45,427 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60404 ms.
2020-09-19 00:08:45,506 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-18 23:55:43,804 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms.
2020-09-18 23:56:11,256 Deploying Source: ee_clevent_presentation -> ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,256 Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw -> (Filter -> cybertronLogReplicated, Filter -> cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:43,872 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-18 23:57:43,940 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60068 ms.
2020-09-18 23:58:44,008 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-18 23:59:11,794 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:27,101 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:44,075 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60066 ms.
2020-09-19 00:00:44,144 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:01:44,634 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60489 ms.
2020-09-19 00:02:44,710 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:03:44,799 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60088 ms.
2020-09-19 00:04:44,871 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:05:44,938 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms.
2020-09-19 00:06:45,023 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.
2020-09-19 00:07:45,427 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60404 ms.
2020-09-19 00:08:45,506 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered.

{code}
Looks like Titus is adding and then removing the task every minute?

> Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19401
>                 URL: https://issues.apache.org/jira/browse/FLINK-19401
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.10.1, 1.11.2
>            Reporter: Steven Zhen Wu
>            Priority: Critical
>             Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover until redeployed. We had some issue with Kafka that initially caused the job to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED to FAILED.
> java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
>         at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
>         at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>         at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>         at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>         at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
>         at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>         at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
>         at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
>         at org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
>         at org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
>         at org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104)
>         at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
>         at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
>         at org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.allocateSlotsFor(DefaultExecutionSlotAllocator.java:102)
>         at org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlots(DefaultScheduler.java:342)
>         at org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlotsAndDeploy(DefaultScheduler.java:311)
>         at org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.allocateSlotsAndDeploy(EagerSchedulingStrategy.java:76)
>         at org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.restartTasks(EagerSchedulingStrategy.java:57)
>         at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$restartTasks$1(DefaultScheduler.java:268)
>         at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
>         at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
>         at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>         at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>         at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>         at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>         at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
>         at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>         at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
>         at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlot(SlotPoolImpl.java:438)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.requestNewAllocatedSlot(SchedulerImpl.java:236)
>         at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:506)
>         ... 39 more
> Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.slotRequestToResourceManagerFailed(SlotPoolImpl.java:360)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.lambda$requestSlotFromResourceManager$1(SlotPoolImpl.java:348)
>         at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>         at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
>         at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
>         at org.apache.flink.runtime.concurrent.FutureUtils.whenCompleteAsyncIfNotDone(FutureUtils.java:941)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestSlotFromResourceManager(SlotPoolImpl.java:342)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlotInternal(SlotPoolImpl.java:309)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlot(SlotPoolImpl.java:437)
>         ... 41 more
> Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not find registered job manager for job 70216adbeed914b35d77717c4b7b13ea.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>         at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
>         at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
>         at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:214)
>         at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:129)
>         at org.apache.flink.runtime.rpc.akka.FencedAkkaInvocationHandler.invoke(FencedAkkaInvocationHandler.java:78)
>         at com.sun.proxy.$Proxy94.requestSlot(Unknown Source)
>         at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestSlotFromResourceManager(SlotPoolImpl.java:337)
>         ... 43 more
> Caused by: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not find registered job manager for job 70216adbeed914b35d77717c4b7b13ea.
>         at org.apache.flink.runtime.resourcemanager.ResourceManager.requestSlot(ResourceManager.java:443)
>         at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)
>         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)
>         ... 20 more
> {code}
> Grepped through our log.  job manager registration happened once when the job was deployed a few days ago.
> {code}
> 2020-09-16 17:23:28,081 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-60]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> {code}
> Then there were a flurry of 9 registrations in the same milli-seconds that happened ~30 mins after the first error of " Could not find registered job manager". The issue persisted many hours after this. Because this is a pre-prod job, so we didn't have alert on it.
> {code}
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,827 INFO  com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963]  - Registered job manager a7263fdf0cd75d4d6481858f89894876@akka.tcp://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea.
> 2020-09-19 00:33:07,828 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [flink-akka.actor.default-dispatcher-35968]  - JobManager successfully registered at ResourceManager, leader id: bf239dac186bc8ba901a8702f4bb42e3.
> {code}
> I have the job manager logs for the hour with INFO level that I can share offline if needed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)