You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Theo Diefenthal (Jira)" <ji...@apache.org> on 2019/12/02 11:08:00 UTC

[jira] [Commented] (FLINK-13418) Avoid InfluxdbReporter to report unnecessary tags

    [ https://issues.apache.org/jira/browse/FLINK-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985965#comment-16985965 ] 

Theo Diefenthal commented on FLINK-13418:
-----------------------------------------

I think that the major concern here is the bridge between Flink and InfluxDB and the problem in the end comes down to the reason why we use metrics at all:

We usually use metrics to find and explain problems when/after they occurred. It is thus espcially important for a metric system to be stable on application crash/problems.

Currently, if my job breaks for some reason and tends to restart very often, it will soon after crash influxDB as already explained above. We _could_ limit the number of restarts somehow, but for me, I really want my jobs to try restarting all the time as I usually expect some partner system to be down and don't have a failure in my application code causing continuos restarts.

So the other option is that when doing restarts, InfluxDB memory requirements should not grow indefinitely which thus means that we need to keep the tag cardinality constant. (BTW Thanks [~yunta] for pointing me to tsi1 which reduced our problems a lot, but not completely). In my case when properly assigning task names and ids and using Flink on YARN, I observe the following problematic tags, i.e. tags with high cardinality and growing on restart/reschedule, ordered by cardinality desc. 
{code:java}
task_attempt_id
tm_id
job_id
task_attempt_num
{code}
For those tags, it would be great if we could disable them or store them as a field, at best configurable. I know that storing them as a field would cause much storage overhead and losing the index, but we could compute the storage capacity beforehand and plan our resources. In case of tags, they can just explode unexpectedly on application crash without any resource limitations, just limited on how fast the application restarts.

> Avoid InfluxdbReporter to report unnecessary tags
> -------------------------------------------------
>
>                 Key: FLINK-13418
>                 URL: https://issues.apache.org/jira/browse/FLINK-13418
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>            Reporter: Yun Tang
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently, when building measurement info within {{InfluxdbReporter}}, it would involve all variables as tags (please see code [here|https://github.com/apache/flink/blob/d57741cef9d4773cc487418baa961254d0d47524/flink-metrics/flink-metrics-influxdb/src/main/java/org/apache/flink/metrics/influxdb/MeasurementInfoProvider.java#L54]). However, user could adjust their own scope format to abort unnecessary scope, while {{InfluxdbReporter}} could report all the scopes as tags to InfluxDB.
> This is due to current {{MetricGroup}} lacks of any method to get necessary scopes but only {{#getScopeComponents()}} or {{#getAllVariables()}}. In other words, InfluxDB need tag-key and tag-value to compose as its tags while we could only get all variables (without any filter acording to scope format) or only scopeComponents (could be treated as tag-value). I think that's why previous implementation have to report all tags.
> From our experience on InfluxDB, as the size of tags contribute to the overall series in InfluxDB, it would never be a good idea to contain too many tags, not to mention the [default value of series per database|https://docs.influxdata.com/influxdb/v1.7/troubleshooting/errors/#error-max-series-per-database-exceeded] is only one million.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)