You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (JIRA)" <ji...@apache.org> on 2019/01/30 09:12:00 UTC

[jira] [Commented] (FLINK-11457) PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels

    [ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755884#comment-16755884 ] 

Chesnay Schepler commented on FLINK-11457:
------------------------------------------

We've had a rather lengthy discussion about this very issue in the [PR|https://github.com/apache/flink/pull/5857] for FLINK-9187.

My conclusion was that we should use the ID of the Job-/TaskManager as the job name, since this is a stable across the lifetime of a Flink process and technically doesn't introduce additional label values (since it's just a duplicate of tm_id). However this is blocked on adding IDs to the Dispatcher (which internally runs the JobManager), see FLINK-9543.

It is not possible to "just count up", as for this the different processes would have to coordinate in some way.

> PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-11457
>                 URL: https://issues.apache.org/jira/browse/FLINK-11457
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Oscar Westra van Holthe - Kind
>            Priority: Major
>
> When using the PrometheusPushGatewayReporter, one has two options:
>  * Use a fixed job name, which causes the jobmanager and taskmanager to overwrite each others metrics (i.e. last write wins, and you lose a lot of metrics)
>  * Use a random suffix for the job name, which creates a lot of labels that have to be cleaned up manually
> The manual cleanup should not be necessary, but happens nonetheless when using a yarn cluster.
> A fix could be to add a suffix the job name, naming the nodes in a non-random manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is available), or some other stable (!) suffix.
> Related discussion: FLINK-9187
>  
> Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)