You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2017/05/14 20:24:04 UTC
[jira] [Comment Edited] (FLINK-6440) Noisy logs from metric fetcher

    [ https://issues.apache.org/jira/browse/FLINK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009838#comment-16009838 ] 

Chesnay Schepler edited comment on FLINK-6440 at 5/14/17 8:23 PM:
------------------------------------------------------------------

I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the {{MetricQueryService}} is unreachable and this should be logged if that's the case.

We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time.

Finally, we could track the unreachable status of the MQS for each TaskManager; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again).

Sadly there isn't something like a {{TaskmanagerStatusListener}} interface, this would be useful to track/clean-up state by {{TaskManager}}.


was (Author: zentol):
I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the {{MetricQueryService}} is unreachable and this should be logged if that's the case.

We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time.

Finally, we could track the unreachable status of the MQS; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again).

Sadly there isn't something like a {{TaskmanagerStatusListener}} interface, this would be useful to track/clean-up state by {{TaskManager}}.

> Noisy logs from metric fetcher
> ------------------------------
>
>                 Key: FLINK-6440
>                 URL: https://issues.apache.org/jira/browse/FLINK-6440
>             Project: Flink
>          Issue Type: Bug
>          Components: Webfrontend
>    Affects Versions: 1.3.0
>            Reporter: Stephan Ewen
>            Priority: Critical
>             Fix For: 1.3.0
>
>
> In cases where TaskManagers fail, the web frontend in the Job Manager starts logging the exception below every few seconds.
> I labeled this as critical, because it actually makes debugging in such a situation complicated through a log that is flooded with noise.
> {code}
> 2017-05-03 19:37:07,823 WARN  org.apache.flink.runtime.webmonitor.metrics.MetricFetcher     - Fetching metrics failed.
> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@herman:52175/user/MetricQueryService_136f717a6b91e248282cb2937d22088c]] after [10000 ms]
>         at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>         at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>         at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>         at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>         at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)