You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Richard Moorhead <ri...@gmail.com> on 2020/02/13 01:40:24 UTC

UI stability at high parallelism

When I submit a job to flink session with parallelism higher than 128, the
job is submitted and renders in the UI but when I view the job itself the
UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying
stability issues?

Re: UI stability at high parallelism

Posted by 张光辉 <be...@gmail.com>.
We also encountered a similar issue internally. cc +huweihua.ckl

Richard Moorhead <ri...@gmail.com> 于2020年2月13日周四 上午9:40写道:

> When I submit a job to flink session with parallelism higher than 128, the
> job is submitted and renders in the UI but when I view the job itself the
> UI starts to rapidly emit errors in the upper right:
>
> Server Response:
> Unable to load requested file /bad-request.
>
> Is this a known issue? Is there a fix? Does this indicate underlying
> stability issues?
>

Re: UI stability at high parallelism

Posted by Weihua Hu <hu...@gmail.com>.
These logs prove that it is indeed a timeout issue, In our scenario, it was due to the task deploy took a lot of time.
You can check if the time from Task from SCHEDULED to DEPLOYING in the log is greater than 10s. This step are processed in mainThread and will block the processing of requests from the UI. 

By now, you can increase the ‘akka.ask.timeout’ to avoid this. 

I have created a jira issue to improve this. https://issues.apache.org/jira/browse/FLINK-16069 <https://issues.apache.org/jira/browse/FLINK-16069> .

Best
Weihua Hu

> 2020年2月15日 01:54,Richard Moorhead <ri...@gmail.com> 写道:
> 
> 2020-02-14 11:50:35,402 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Unhandled exception.
> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
> 	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
> 	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
> 	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
> 	at java.lang.Thread.run(Thread.java:748)
> 
> 
> 
> On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <huweihua.ckl@gmail.com <ma...@gmail.com>> wrote:
> Hi, Richard
> 
> This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.
> 
> You can provide the full log to help us find the root cause.
> 
> 
> Best
> Weihua Hu
> 
>> 2020年2月13日 09:40,Richard Moorhead <richard.moorhead@gmail.com <ma...@gmail.com>> 写道:
>> 
>> When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:
>> 
>> Server Response:
>> Unable to load requested file /bad-request.
>> 
>> Is this a known issue? Is there a fix? Does this indicate underlying stability issues?
> 


Re: UI stability at high parallelism

Posted by Richard Moorhead <ri...@gmail.com>.
2020-02-14 11:50:35,402 ERROR
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled
exception.
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message
of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.
at
akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at
akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at
akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
at
akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)

On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <hu...@gmail.com> wrote:

> Hi, Richard
>
> This is most likely that the Rest Api has timed out, you can try to find
> some evidence in the jobmanager log.
>
> You can provide the full log to help us find the root cause.
>
>
> Best
> Weihua Hu
>
> 2020年2月13日 09:40,Richard Moorhead <ri...@gmail.com> 写道:
>
> When I submit a job to flink session with parallelism higher than 128, the
> job is submitted and renders in the UI but when I view the job itself the
> UI starts to rapidly emit errors in the upper right:
>
> Server Response:
> Unable to load requested file /bad-request.
>
> Is this a known issue? Is there a fix? Does this indicate underlying
> stability issues?
>
>
>

Re: UI stability at high parallelism

Posted by HuWeihua <hu...@gmail.com>.
Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.


Best
Weihua Hu

> 2020年2月13日 09:40,Richard Moorhead <ri...@gmail.com> 写道:
> 
> When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:
> 
> Server Response:
> Unable to load requested file /bad-request.
> 
> Is this a known issue? Is there a fix? Does this indicate underlying stability issues?