You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Zhou Parker (Jira)" <ji...@apache.org> on 2021/10/29 07:06:00 UTC

[jira] [Created] (FLINK-24692) kubernetes session mode deployment failed since slot allocation timeout

Zhou Parker created FLINK-24692:
-----------------------------------

             Summary: kubernetes session mode deployment failed since slot allocation timeout
                 Key: FLINK-24692
                 URL: https://issues.apache.org/jira/browse/FLINK-24692
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.11.2
            Reporter: Zhou Parker


Kubernetes: 1.15
Flink: 1.11.2
 
When submit {{TopSpeedWindowing demo with session mode on k8s. Job failed.}}
{{}}
{{log from JM:}}
 
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
    ... 45 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_275]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_275]
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_275]
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_275]
    ... 25 more
Caused by: java.util.concurrent.TimeoutException
    ... 23 more
 

Log from TM:

 

2021-10-29 06:54:22,862 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/rpc/taskmanager_0 .
2021-10-29 06:54:22,875 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Start job leader service.
2021-10-29 06:54:22,877 INFO org.apache.flink.runtime.filecache.FileCache [] - User file cache uses directory /tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
2021-10-29 06:54:22,935 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@test.default:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2021-10-29 06:54:22,940 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@test.default:6123/user/rpc/resourcemanager_*. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
2021-10-29 06:54:23,265 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved ResourceManager address, beginning registration
2021-10-29 06:54:23,265 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at ResourceManager attempt 1 (timeout=100ms)
2021-10-29 06:54:23,391 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Successful registration at resource manager akka.tcp://flink@test.default:6123/user/rpc/resourcemanager_* under registration id dca9eaff5da556d2b99bd447a07538b7.
2021-10-29 06:54:23,456 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Receive slot request 190c5be552e5aed60834096b6e1efc2f for job f5680609a3e78061e63e97268e1860c6 from resource manager with leader id 00000000000000000000000000000000.
2021-10-29 06:54:23,462 DEBUG org.apache.flink.runtime.memory.MemoryManager [] - Initialized MemoryManager with total memory size 536870920 and page size 32768.
2021-10-29 06:54:23,464 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Allocated slot for 190c5be552e5aed60834096b6e1efc2f.
2021-10-29 06:54:23,465 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Add job f5680609a3e78061e63e97268e1860c6 for job leader monitoring.
2021-10-29 06:54:23,466 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - New leader information for job f5680609a3e78061e63e97268e1860c6. Address: akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2, leader id: 00000000000000000000000000000000.
2021-10-29 06:54:23,467 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Try to register at job manager akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2 with leader id 00000000-0000-0000-0000-000000000000.
2021-10-29 06:54:23,468 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2. Returning a org.apache.flink.runtime.jobmaster.JobMasterGateway gateway.
2021-10-29 06:54:23,541 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Resolved JobManager address, beginning registration
2021-10-29 06:54:23,542 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 1 (timeout=100ms)
2021-10-29 06:54:23,660 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 1 timed out after 100 ms
2021-10-29 06:54:23,660 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 2 (timeout=200ms)
2021-10-29 06:54:23,878 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 2 timed out after 200 ms
2021-10-29 06:54:23,879 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 3 (timeout=400ms)
2021-10-29 06:54:24,299 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 3 timed out after 400 ms
2021-10-29 06:54:24,299 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 4 (timeout=800ms)
2021-10-29 06:54:25,118 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 4 timed out after 800 ms
2021-10-29 06:54:25,119 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 5 (timeout=1600ms)
2021-10-29 06:54:26,603 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:26,739 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 5 timed out after 1600 ms
2021-10-29 06:54:26,739 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 6 (timeout=3200ms)
2021-10-29 06:54:29,958 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager (akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2) attempt 6 timed out after 3200 ms
2021-10-29 06:54:29,959 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Registration at JobManager attempt 7 (timeout=6400ms)
2021-10-29 06:54:33,465 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Free slot with allocation id 190c5be552e5aed60834096b6e1efc2f because: The slot 190c5be552e5aed60834096b6e1efc2f has timed out.
2021-10-29 06:54:33,466 DEBUG org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile\{cpuCores=1.0000000000000000, taskHeapMemory=384.000mb (402653174 bytes), taskOffHeapMemory=0 bytes, managedMemory=512.000mb (536870920 bytes), networkMemory=128.000mb (134217730 bytes)}, allocationId: 190c5be552e5aed60834096b6e1efc2f, jobId: f5680609a3e78061e63e97268e1860c6).
java.lang.Exception: The slot 190c5be552e5aed60834096b6e1efc2f has timed out.
 at org.apache.flink.runtime.taskexecutor.TaskExecutor.timeoutSlot(TaskExecutor.java:1653) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$2800(TaskExecutor.java:173) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.taskexecutor.TaskExecutor$SlotActionsImpl.lambda$timeoutSlot$1(TaskExecutor.java:1940) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.2.jar:1.11.2]
2021-10-29 06:54:33,471 INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job f5680609a3e78061e63e97268e1860c6 from job leader monitoring.
2021-10-29 06:54:33,471 DEBUG org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Retrying registration towards akka.tcp://flink@test.default:6123/user/rpc/jobmanager_2 was cancelled.
2021-10-29 06:54:33,472 DEBUG org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Releasing local state under allocation id 190c5be552e5aed60834096b6e1efc2f.
2021-10-29 06:54:36,622 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:46,642 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:54:56,662 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received heartbeat request from 8edb8ed60a1b18ffb9913e3d01670115.
2021-10-29 06:55:06,616 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Close ResourceManager connection 8edb8ed60a1b18ffb9913e3d01670115.
org.apache.flink.util.FlinkException: TaskExecutor exceeded the idle timeout.
 at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.releaseTaskExecutor(SlotManagerImpl.java:1258) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.lambda$releaseTaskExecutorIfPossible$14(SlotManagerImpl.java:1251) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) ~[?:1.8.0_275]
 at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) ~[?:1.8.0_275]
 at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) ~[?:1.8.0_275]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.2.jar:1.11.2]
 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.2.jar:1.11.2]
2021-10-29 06:55:06,622 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@test.default:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2021-10-29 06:55:06,623 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@test.default:6123/user/rpc/resourcemanager_*. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
2021-10-29 06:55:06,631 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Resolved ResourceManager address, beginning registration
2021-10-29 06:55:06,631 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at ResourceManager attempt 1 (timeout=100ms)
2021-10-29 06:55:06,636 INFO org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2021-10-29 06:55:06,638 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2021-10-29 06:55:06,639 DEBUG org.apache.flink.runtime.io.disk.iomanager.IOManager [] - Shutting down I/O manager.
2021-10-29 06:55:06,640 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-7fb5ad02-77e1-4942-8ab6-3e10347664c4
2021-10-29 06:55:06,641 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2021-10-29 06:55:06,643 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2021-10-29 06:55:06,645 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /tmp/flink-io-66cad1f9-ce74-4c01-a02b-32d2e11dcb5a
2021-10-29 06:55:06,646 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /tmp/flink-netty-shuffle-bbc6e6a4-9973-48a5-83b1-3ef94d8605f3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)