You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chenyu Zheng <ch...@hulu.com> on 2021/08/10 11:02:11 UTC

Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

Hi there,

I’m trying to run my flink job on Kubernetes cluster, but when I try to give my job a larger parallelism (128) I get an error said “java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled.

We confirmed it cannot be a network issue, since:

  *   We encounter this error every time we run this job with larger parallelism (128), but it’s OK with smaller parallelism (32/64).
  *   We are using the k8s cluster in the production environment, and no other containers have the network problems.
  *   When we give “heartbeat.timeout” a larger value like 300s, the error never occurs again.

My settings and environment:

  *   Flink 1.12.5 with java8, scala 2.11
  *   Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/jobmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=1073741824b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=15703474176b -D jobmanager.memory.jvm-overhead.max=1073741824b
  *   Taskmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/taskmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=359703515b -D taskmanager.memory.network.min=359703515b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=1530082070b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.jvm-metaspace.size=268435456b -D taskmanager.memory.jvm-overhead.max=429496736b -D taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf -Djobmanager.rpc.address='10.50.132.154' -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' -Drest.address='10.50.132.154' -Djobmanager.memory.jvm-overhead.max='1073741824b' -Djobmanager.memory.jvm-overhead.min='1073741824b' -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17' -Dexecution.target='embedded' -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='15703474176b'

Is this an expected behavior? Could you give me some guideline about how to troubleshot this issue?

BRs

Chenyu

Re: Re: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

Posted by Yun Gao <yu...@aliyun.com>.
Hi Chenyu,

The tipically reasons for the heartbeat timeout includes:
1. Long GC time in TM / JM 
2. Network instability

Thus does the GC log or network monitor metrics could give
some hints ?

Best,
Yun


------------------------------------------------------------------
Sender:Chenyu Zheng<ch...@hulu.com>
Date:2021/08/10 19:07:00
Recipient:user@flink.apache.org<us...@flink.apache.org>
Theme:Re: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

      
JobManager timeout error:
2021-08-10 09:58:35,350 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: Print to Std. Out (79/128) (b498a5b17c87eb70c3da9aea93890e25) switched from DEPLOYING to FAILED on stream-93072a8b402f49cca9c134a6e8b4887a-taskmanager-1-121 @ 10.50.151.120 (dataPort=46281).
java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
2021-08-10 09:58:35,357 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution b498a5b17c87eb70c3da9aea93890e25.
2021-08-10 09:58:35,362 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 6d2677a0ecc3fd8df0b72ec675edf8f4_78.
2021-08-10 09:58:35,433 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 512 tasks should be restarted to recover the failed task 6d2677a0ecc3fd8df0b72ec675edf8f4_78. 
2021-08-10 09:58:35,437 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Click Event Count (00000000000000000000000000000000) switched from state RUNNING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
            at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:118) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:80) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:233) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:224) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:215) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:666) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:56) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1869) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1463) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1403) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:1081) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:213) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:200) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.SharedSlot.lambda$release$4(SharedSlot.java:272) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) ~[?:1.8.0_302]
            at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) ~[?:1.8.0_302]
            at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) ~[?:1.8.0_302]
            at org.apache.flink.runtime.scheduler.SharedSlot.release(SharedSlot.java:272) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:152) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:941) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:892) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:508) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_302]
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_302]
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_302]
            at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            ... 19 more
 
 
TaskManager timeout error:
 
2021-08-10 10:45:36,845 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Source: ClickEvent Source (12/128)#0 (f0a1cc294d932e9525fba443f10aa1ce) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1593) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectAndTryReconnectToJobManager(TaskExecutor.java:1163) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$3600(TaskExecutor.java:176) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 1da1bb0693814dd8cc2549e4f5cd368a timed out.
            ... 27 more
 
 
 
From: Chenyu Zheng <ch...@hulu.com>
Date: Tuesday, August 10, 2021 at 7:02 PM
To: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out 
  
Hi there,
 
I’m trying to run my flink job on Kubernetes cluster, but when I try to give my job a larger parallelism (128) I get an error said “java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled. 
 
We confirmed it cannot be a network issue, since: 
We encounter this error every time we run this job with larger parallelism (128), but it’s OK with smaller parallelism (32/64).
We are using the k8s cluster in the production environment, and no other containers have the network problems.
When we give “heartbeat.timeout” a larger value like 300s, the error never occurs again.
 
My settings and environment: 
Flink 1.12.5 with java8, scala 2.11
Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/jobmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=1073741824b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=15703474176b -D jobmanager.memory.jvm-overhead.max=1073741824b
Taskmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/taskmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=359703515b -D taskmanager.memory.network.min=359703515b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=1530082070b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.jvm-metaspace.size=268435456b -D taskmanager.memory.jvm-overhead.max=429496736b -D taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf -Djobmanager.rpc.address='10.50.132.154' -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' -Drest.address='10.50.132.154' -Djobmanager.memory.jvm-overhead.max='1073741824b' -Djobmanager.memory.jvm-overhead.min='1073741824b' -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17' -Dexecution.target='embedded' -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='15703474176b'
 
Is this an expected behavior? Could you give me some guideline about how to troubleshot this issue? 
 
BRs
 
Chenyu    

Re: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

Posted by Chenyu Zheng <ch...@hulu.com>.
JobManager timeout error:
2021-08-10 09:58:35,350 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: Print to Std. Out (79/128) (b498a5b17c87eb70c3da9aea93890e25) switched from DEPLOYING to FAILED on stream-93072a8b402f49cca9c134a6e8b4887a-taskmanager-1-121 @ 10.50.151.120 (dataPort=46281).
java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
2021-08-10 09:58:35,357 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution b498a5b17c87eb70c3da9aea93890e25.
2021-08-10 09:58:35,362 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task 6d2677a0ecc3fd8df0b72ec675edf8f4_78.
2021-08-10 09:58:35,433 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 512 tasks should be restarted to recover the failed task 6d2677a0ecc3fd8df0b72ec675edf8f4_78.
2021-08-10 09:58:35,437 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Click Event Count (00000000000000000000000000000000) switched from state RUNNING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
            at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:118) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:80) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:233) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:224) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:215) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:666) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:56) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1869) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1463) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1403) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:1081) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:213) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:200) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.scheduler.SharedSlot.lambda$release$4(SharedSlot.java:272) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) ~[?:1.8.0_302]
            at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) ~[?:1.8.0_302]
            at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) ~[?:1.8.0_302]
            at org.apache.flink.runtime.scheduler.SharedSlot.release(SharedSlot.java:272) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:152) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:941) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:892) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:508) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_302]
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_302]
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_302]
            at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            ... 19 more


TaskManager timeout error:

2021-08-10 10:45:36,845 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Source: ClickEvent Source (12/128)#0 (f0a1cc294d932e9525fba443f10aa1ce) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership.
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1593) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectAndTryReconnectToJobManager(TaskExecutor.java:1163) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$3600(TaskExecutor.java:176) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:2260) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_302]
            at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:2258) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_302]
            at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_302]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.5.jar:1.12.5]
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 1da1bb0693814dd8cc2549e4f5cd368a timed out.
            ... 27 more



From: Chenyu Zheng <ch...@hulu.com>
Date: Tuesday, August 10, 2021 at 7:02 PM
To: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

Hi there,

I’m trying to run my flink job on Kubernetes cluster, but when I try to give my job a larger parallelism (128) I get an error said “java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 56ad1a5ded99f9f16ec1c786ad299159 timed out.” And then my job is cancelled.

We confirmed it cannot be a network issue, since:

  *   We encounter this error every time we run this job with larger parallelism (128), but it’s OK with smaller parallelism (32/64).
  *   We are using the k8s cluster in the production environment, and no other containers have the network problems.
  *   When we give “heartbeat.timeout” a larger value like 300s, the error never occurs again.

My settings and environment:

  *   Flink 1.12.5 with java8, scala 2.11
  *   Jobmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/jobmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=1073741824b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=15703474176b -D jobmanager.memory.jvm-overhead.max=1073741824b
  *   Taskmanager Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 -XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -Dlog.file=/opt/flink/log/taskmanager.log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=359703515b -D taskmanager.memory.network.min=359703515b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=1530082070b -D taskmanager.memory.task.off-heap.size=0b -D taskmanager.memory.jvm-metaspace.size=268435456b -D taskmanager.memory.jvm-overhead.max=429496736b -D taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf -Djobmanager.rpc.address='10.50.132.154' -Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.off-heap.size='134217728b' -Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' -Drest.address='10.50.132.154' -Djobmanager.memory.jvm-overhead.max='1073741824b' -Djobmanager.memory.jvm-overhead.min='1073741824b' -Dtaskmanager.resource-id='stream-3111167f634e41349f7195961cdb0c6c-taskmanager-1-17' -Dexecution.target='embedded' -Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='15703474176b'

Is this an expected behavior? Could you give me some guideline about how to troubleshot this issue?

BRs

Chenyu