You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/03/29 14:26:00 UTC
[jira] [Commented] (FLINK-8624) flink-mesos: The flink rest-api sometimes becomes unresponsive

    [ https://issues.apache.org/jira/browse/FLINK-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419081#comment-16419081 ] 

Till Rohrmann commented on FLINK-8624:
--------------------------------------

Hi [~bbayani], does this problem also exists for the Flink 1.5 release branch?

> flink-mesos: The flink rest-api sometimes becomes unresponsive
> --------------------------------------------------------------
>
>                 Key: FLINK-8624
>                 URL: https://issues.apache.org/jira/browse/FLINK-8624
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, REST
>    Affects Versions: 1.3.2
>            Reporter: Bhumika Bayani
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> Sometimes flink-mesos-scheduler fails/get killed, and marathon brings it up again on some other node. Sometimes we have observed, the rest-api of the newly created flink instance becomes unresponsive.
> Even if we execute api calls manually with curl, such as 
> http://<host>:<port>/overview or http://<host>:<port>/config
> we do not receive any response. 
> We submit and execute all our flink-jobs using rest-api only. So if rest api becomes un-responsive, that stops us from running any of the flink jobs and no stream processing happens. 
> We tried enabling flink debug logs, but we did not observer anything specific that indicates why rest api is failing/unresponsive.
> We see below exceptions in logs but that is not specific to case when flink-api is hung. We see them in healthy flink-scheduler too: 
>  
> {code:java}
> Timestamp=2018-02-08 05:43:49,175 LogLevel=INFO
>         ThreadId=[Checkpoint Timer] Class=o.a.f.r.c.CheckpointCoordinator Msg=Triggering checkpoint 10181 @ 1518068629174
> Timestamp=2018-02-08 05:43:49,183 LogLevel=DEBUG
>         ThreadId=[nioEventLoopGroup-5-3] Class=o.a.f.r.w.WebRuntimeMonitor Msg=Unhandled exception: {}
> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager#753807801]] after [10000 ms]
>         at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> {code}
>  
> During the time rest api is unresponsive, we have observed flink web UI too does not load/show any information. 
> Restarting the flink-scheduler solves this issue sometimes. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)