You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Filip Karnicki <fi...@gmail.com> on 2022/06/30 12:45:51 UTC

influxdb metrics reporter - 4k series per job restart

Hi All

We're using the influx reporter (flink 1.14.3), which seems to create a
series per:
-[task|job]manager
- host
- job_id
- job_name
- subtask_index
- task_attempt_id
- task_attempt_num
- task_id
- tm_id

which amounts to about 4k of series each time our job restarts itself

We are currently experiencing problems with checkpoint duration timeouts (>
60s) (unrelated) and every 60 secs our job restarts and creates further 4k
series in influxdb.

Needless to say, the team managing influxdb is not too happy with the
amount of series we create.

Is there anything I can do to either reduce the number of series, or reduce
the number of types of metrics in order to produce fewer series? (we don't
view all the available metrics in grafana, so we don't necessarily have to
send all of them)

The db caps at 1M series, and with our current problems with checkpointing
we go through that many in a matter of hours

Many thanks
Fil

Re: influxdb metrics reporter - 4k series per job restart

Posted by Filip Karnicki <fi...@gmail.com>.
Hi All,

Thank you for your replies. What ended up working for me was setting

metrics.reporter.influxdb.scope.variables.excludes:
job_id;task_attempt_num;tm_id;task_id;operator_id;task_attempt_id


On Fri, 1 Jul 2022 at 18:36, Mason Chen <ma...@gmail.com> wrote:

> Hi all,
>
> If you can wait for Flink 1.16, there is a new feature to filter metrics
> (includes/excludes filter). Additionally, you can already take advantage of
> dropping unnecessary labels with `scope.variables.excludes` in the current
> release. Link to 1.16 metric features:
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter
>
> Best,
> Mason
>
> On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Have you considered setting the value for some of the series to a fixed
>> value? For example, if you're not interested in the value for <task_id>,
>> you could consider setting that to a fixed value "task_id" [1] ?
>>
>> Best regards,
>>
>> Martijn
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>>
>> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>>
>>> Hi, Filip
>>>
>>> You can modify the InfluxdbReporter code to rewrite the
>>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>>> series per:
>>>> -[task|job]manager
>>>> - host
>>>> - job_id
>>>> - job_name
>>>> - subtask_index
>>>> - task_attempt_id
>>>> - task_attempt_num
>>>> - task_id
>>>> - tm_id
>>>>
>>>> which amounts to about 4k of series each time our job restarts itself
>>>>
>>>> We are currently experiencing problems with checkpoint duration
>>>> timeouts (> 60s) (unrelated) and every 60 secs our job restarts and creates
>>>> further 4k series in influxdb.
>>>>
>>>> Needless to say, the team managing influxdb is not too happy with the
>>>> amount of series we create.
>>>>
>>>> Is there anything I can do to either reduce the number of series, or
>>>> reduce the number of types of metrics in order to produce fewer series? (we
>>>> don't view all the available metrics in grafana, so we don't necessarily
>>>> have to send all of them)
>>>>
>>>> The db caps at 1M series, and with our current problems with
>>>> checkpointing we go through that many in a matter of hours
>>>>
>>>> Many thanks
>>>> Fil
>>>>
>>>>

Re: influxdb metrics reporter - 4k series per job restart

Posted by Mason Chen <ma...@gmail.com>.
Hi all,

If you can wait for Flink 1.16, there is a new feature to filter metrics
(includes/excludes filter). Additionally, you can already take advantage of
dropping unnecessary labels with `scope.variables.excludes` in the current
release. Link to 1.16 metric features:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter

Best,
Mason

On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
wrote:

> Have you considered setting the value for some of the series to a fixed
> value? For example, if you're not interested in the value for <task_id>,
> you could consider setting that to a fixed value "task_id" [1] ?
>
> Best regards,
>
> Martijn
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>
> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>
>> Hi, Filip
>>
>> You can modify the InfluxdbReporter code to rewrite the
>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>
>> Best,
>> Weihua
>>
>>
>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>> series per:
>>> -[task|job]manager
>>> - host
>>> - job_id
>>> - job_name
>>> - subtask_index
>>> - task_attempt_id
>>> - task_attempt_num
>>> - task_id
>>> - tm_id
>>>
>>> which amounts to about 4k of series each time our job restarts itself
>>>
>>> We are currently experiencing problems with checkpoint duration timeouts
>>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>>> 4k series in influxdb.
>>>
>>> Needless to say, the team managing influxdb is not too happy with the
>>> amount of series we create.
>>>
>>> Is there anything I can do to either reduce the number of series, or
>>> reduce the number of types of metrics in order to produce fewer series? (we
>>> don't view all the available metrics in grafana, so we don't necessarily
>>> have to send all of them)
>>>
>>> The db caps at 1M series, and with our current problems with
>>> checkpointing we go through that many in a matter of hours
>>>
>>> Many thanks
>>> Fil
>>>
>>>

Re: influxdb metrics reporter - 4k series per job restart

Posted by Martijn Visser <ma...@apache.org>.
Have you considered setting the value for some of the series to a fixed
value? For example, if you're not interested in the value for <task_id>,
you could consider setting that to a fixed value "task_id" [1] ?

Best regards,

Martijn

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope

Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:

> Hi, Filip
>
> You can modify the InfluxdbReporter code to rewrite the
> notifyOfAddedMetric method and filter the required metrics for reporting.
>
> Best,
> Weihua
>
>
> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
> wrote:
>
>> Hi All
>>
>> We're using the influx reporter (flink 1.14.3), which seems to create a
>> series per:
>> -[task|job]manager
>> - host
>> - job_id
>> - job_name
>> - subtask_index
>> - task_attempt_id
>> - task_attempt_num
>> - task_id
>> - tm_id
>>
>> which amounts to about 4k of series each time our job restarts itself
>>
>> We are currently experiencing problems with checkpoint duration timeouts
>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>> 4k series in influxdb.
>>
>> Needless to say, the team managing influxdb is not too happy with the
>> amount of series we create.
>>
>> Is there anything I can do to either reduce the number of series, or
>> reduce the number of types of metrics in order to produce fewer series? (we
>> don't view all the available metrics in grafana, so we don't necessarily
>> have to send all of them)
>>
>> The db caps at 1M series, and with our current problems with
>> checkpointing we go through that many in a matter of hours
>>
>> Many thanks
>> Fil
>>
>>

Re: influxdb metrics reporter - 4k series per job restart

Posted by Weihua Hu <hu...@gmail.com>.
Hi, Filip

You can modify the InfluxdbReporter code to rewrite the notifyOfAddedMetric
method and filter the required metrics for reporting.

Best,
Weihua


On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
wrote:

> Hi All
>
> We're using the influx reporter (flink 1.14.3), which seems to create a
> series per:
> -[task|job]manager
> - host
> - job_id
> - job_name
> - subtask_index
> - task_attempt_id
> - task_attempt_num
> - task_id
> - tm_id
>
> which amounts to about 4k of series each time our job restarts itself
>
> We are currently experiencing problems with checkpoint duration timeouts
> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
> 4k series in influxdb.
>
> Needless to say, the team managing influxdb is not too happy with the
> amount of series we create.
>
> Is there anything I can do to either reduce the number of series, or
> reduce the number of types of metrics in order to produce fewer series? (we
> don't view all the available metrics in grafana, so we don't necessarily
> have to send all of them)
>
> The db caps at 1M series, and with our current problems with checkpointing
> we go through that many in a matter of hours
>
> Many thanks
> Fil
>
>