You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/10/11 08:12:00 UTC

[jira] [Commented] (FLINK-10521) TaskManager metrics are not reported to prometheus after running a job

    [ https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646104#comment-16646104 ] 

Till Rohrmann commented on FLINK-10521:
---------------------------------------

Can you add the debug logs [~florianschmidt]?

> TaskManager metrics are not reported to prometheus after running a job
> ----------------------------------------------------------------------
>
>                 Key: FLINK-10521
>                 URL: https://issues.apache.org/jira/browse/FLINK-10521
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>    Affects Versions: 1.6.1
>         Environment: Flink 1.6.1 cluster with one taskmanager and one jobmanager, prometheus and grafana, all started in a local docker environment.
> See sample project at: https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
>            Reporter: Florian Schmidt
>            Priority: Major
>         Attachments: Screenshot 2018-10-10 at 11.32.59.png
>
>
> Update: This only seems to happen when my custom (admittedly poorly implemented) Histogram is enabled. Still I think one poorly implemented metric should not bring down the whole metrics system.
> --
> I'm using prometheus to collect the metrics from Flink, and I noticed that shortly after running a job, metrics from the taskmanager will stop being reported most of the time.
> Looking at the prometheus logs I can see that requests to taskmanager:9249/metrics are correct when no job is running, but after starting to run a job those requests will return an empty response with increasing frequency, until at some point most of the requests are not successful anymore. I was able to very this by running `curl localhost:9249/metrics` inside the taskmanager container, where more often that not the response was empty, instead of containing the expected metrics.
> In the attached image you can see that occasionally some requests succeed, but there are some big gaps in between. Eventually it will stop to succeed completely. The prometheus scrape interval is set to 1s.
> !Screenshot 2018-10-10 at 11.32.59.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)