You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Steven Zhen Wu (Jira)" <ji...@apache.org> on 2020/07/12 18:26:00 UTC

[jira] [Commented] (FLINK-11143) AskTimeoutException is thrown during job submission and completion

    [ https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363 ] 

Steven Zhen Wu commented on FLINK-11143:
----------------------------------------

[~trohrmann] I am seeing a similar problem when trying unaligned checkpoint with 1.11.0. The Flink job actually started fine. We didn't see this AskTimeoutException thrown during job submission in without unaligned checkpoint (1.10 or 1.11).

Some more context about the app
 * a large-state stream join app (a few TBs)
 * parallelism 1,440
 * number of containers: 180
 * Cores per container: 12
 * TM_TASK_SLOTS: 8
 * akka.ask.timeout: 120 s
 * heartbeat.timeout: 120000
 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without any difference)

I will send you the log files (with DEBUG level) in an email offline. Thanks a lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute application.\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'personalization-streaming-impressions-alt'.\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t... 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute job 'personalization-streaming-impressions-alt'.\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t... 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time) timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t... 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t... 1 more\\n\
 {code}
 

> AskTimeoutException is thrown during job submission and completion
> ------------------------------------------------------------------
>
>                 Key: FLINK-11143
>                 URL: https://issues.apache.org/jira/browse/FLINK-11143
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.2, 1.10.0
>            Reporter: Alex Vinnik
>            Priority: Critical
>         Attachments: flink-job-timeline.PNG
>
>
> For more details please see the thread
> [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3CC2FB26F9-1410-4333-80F4-34807481BCB6@gmail.com%3E]
> On submission 
> 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error: Unhandled exception.
>  akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>  at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>  at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>  at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>  at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>  at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>  at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>  at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>  at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>  at java.lang.Thread.run(Thread.java:748)
>  
> On completion
>  
> {"errors":["Internal server error.","<Exception on server side:\njava.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. Sender[null] sent message of type \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. Sender[null] sent message of type \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t... 9 more\n\nEnd of exception on server side>"]}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)