You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Martijn Visser <ma...@apache.org> on 2022/07/01 10:54:50 UTC

Re: influxdb metrics reporter - 4k series per job restart

Have you considered setting the value for some of the series to a fixed
value? For example, if you're not interested in the value for <task_id>,
you could consider setting that to a fixed value "task_id" [1] ?

Best regards,

Martijn

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope

Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:

> Hi, Filip
>
> You can modify the InfluxdbReporter code to rewrite the
> notifyOfAddedMetric method and filter the required metrics for reporting.
>
> Best,
> Weihua
>
>
> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
> wrote:
>
>> Hi All
>>
>> We're using the influx reporter (flink 1.14.3), which seems to create a
>> series per:
>> -[task|job]manager
>> - host
>> - job_id
>> - job_name
>> - subtask_index
>> - task_attempt_id
>> - task_attempt_num
>> - task_id
>> - tm_id
>>
>> which amounts to about 4k of series each time our job restarts itself
>>
>> We are currently experiencing problems with checkpoint duration timeouts
>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>> 4k series in influxdb.
>>
>> Needless to say, the team managing influxdb is not too happy with the
>> amount of series we create.
>>
>> Is there anything I can do to either reduce the number of series, or
>> reduce the number of types of metrics in order to produce fewer series? (we
>> don't view all the available metrics in grafana, so we don't necessarily
>> have to send all of them)
>>
>> The db caps at 1M series, and with our current problems with
>> checkpointing we go through that many in a matter of hours
>>
>> Many thanks
>> Fil
>>
>>

Re: influxdb metrics reporter - 4k series per job restart

Posted by Filip Karnicki <fi...@gmail.com>.
Hi All,

Thank you for your replies. What ended up working for me was setting

metrics.reporter.influxdb.scope.variables.excludes:
job_id;task_attempt_num;tm_id;task_id;operator_id;task_attempt_id


On Fri, 1 Jul 2022 at 18:36, Mason Chen <ma...@gmail.com> wrote:

> Hi all,
>
> If you can wait for Flink 1.16, there is a new feature to filter metrics
> (includes/excludes filter). Additionally, you can already take advantage of
> dropping unnecessary labels with `scope.variables.excludes` in the current
> release. Link to 1.16 metric features:
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter
>
> Best,
> Mason
>
> On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Have you considered setting the value for some of the series to a fixed
>> value? For example, if you're not interested in the value for <task_id>,
>> you could consider setting that to a fixed value "task_id" [1] ?
>>
>> Best regards,
>>
>> Martijn
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>>
>> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>>
>>> Hi, Filip
>>>
>>> You can modify the InfluxdbReporter code to rewrite the
>>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>>> series per:
>>>> -[task|job]manager
>>>> - host
>>>> - job_id
>>>> - job_name
>>>> - subtask_index
>>>> - task_attempt_id
>>>> - task_attempt_num
>>>> - task_id
>>>> - tm_id
>>>>
>>>> which amounts to about 4k of series each time our job restarts itself
>>>>
>>>> We are currently experiencing problems with checkpoint duration
>>>> timeouts (> 60s) (unrelated) and every 60 secs our job restarts and creates
>>>> further 4k series in influxdb.
>>>>
>>>> Needless to say, the team managing influxdb is not too happy with the
>>>> amount of series we create.
>>>>
>>>> Is there anything I can do to either reduce the number of series, or
>>>> reduce the number of types of metrics in order to produce fewer series? (we
>>>> don't view all the available metrics in grafana, so we don't necessarily
>>>> have to send all of them)
>>>>
>>>> The db caps at 1M series, and with our current problems with
>>>> checkpointing we go through that many in a matter of hours
>>>>
>>>> Many thanks
>>>> Fil
>>>>
>>>>

Re: influxdb metrics reporter - 4k series per job restart

Posted by Mason Chen <ma...@gmail.com>.
Hi all,

If you can wait for Flink 1.16, there is a new feature to filter metrics
(includes/excludes filter). Additionally, you can already take advantage of
dropping unnecessary labels with `scope.variables.excludes` in the current
release. Link to 1.16 metric features:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter

Best,
Mason

On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
wrote:

> Have you considered setting the value for some of the series to a fixed
> value? For example, if you're not interested in the value for <task_id>,
> you could consider setting that to a fixed value "task_id" [1] ?
>
> Best regards,
>
> Martijn
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>
> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>
>> Hi, Filip
>>
>> You can modify the InfluxdbReporter code to rewrite the
>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>
>> Best,
>> Weihua
>>
>>
>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>> series per:
>>> -[task|job]manager
>>> - host
>>> - job_id
>>> - job_name
>>> - subtask_index
>>> - task_attempt_id
>>> - task_attempt_num
>>> - task_id
>>> - tm_id
>>>
>>> which amounts to about 4k of series each time our job restarts itself
>>>
>>> We are currently experiencing problems with checkpoint duration timeouts
>>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>>> 4k series in influxdb.
>>>
>>> Needless to say, the team managing influxdb is not too happy with the
>>> amount of series we create.
>>>
>>> Is there anything I can do to either reduce the number of series, or
>>> reduce the number of types of metrics in order to produce fewer series? (we
>>> don't view all the available metrics in grafana, so we don't necessarily
>>> have to send all of them)
>>>
>>> The db caps at 1M series, and with our current problems with
>>> checkpointing we go through that many in a matter of hours
>>>
>>> Many thanks
>>> Fil
>>>
>>>