You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Filip Karnicki <fi...@gmail.com> on 2022/06/30 12:45:51 UTC
influxdb metrics reporter - 4k series per job restart
Hi All
We're using the influx reporter (flink 1.14.3), which seems to create a
series per:
-[task|job]manager
- host
- job_id
- job_name
- subtask_index
- task_attempt_id
- task_attempt_num
- task_id
- tm_id
which amounts to about 4k of series each time our job restarts itself
We are currently experiencing problems with checkpoint duration timeouts (>
60s) (unrelated) and every 60 secs our job restarts and creates further 4k
series in influxdb.
Needless to say, the team managing influxdb is not too happy with the
amount of series we create.
Is there anything I can do to either reduce the number of series, or reduce
the number of types of metrics in order to produce fewer series? (we don't
view all the available metrics in grafana, so we don't necessarily have to
send all of them)
The db caps at 1M series, and with our current problems with checkpointing
we go through that many in a matter of hours
Many thanks
Fil
Re: influxdb metrics reporter - 4k series per job restart
Posted by Filip Karnicki <fi...@gmail.com>.
Hi All,
Thank you for your replies. What ended up working for me was setting
metrics.reporter.influxdb.scope.variables.excludes:
job_id;task_attempt_num;tm_id;task_id;operator_id;task_attempt_id
On Fri, 1 Jul 2022 at 18:36, Mason Chen <ma...@gmail.com> wrote:
> Hi all,
>
> If you can wait for Flink 1.16, there is a new feature to filter metrics
> (includes/excludes filter). Additionally, you can already take advantage of
> dropping unnecessary labels with `scope.variables.excludes` in the current
> release. Link to 1.16 metric features:
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter
>
> Best,
> Mason
>
> On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Have you considered setting the value for some of the series to a fixed
>> value? For example, if you're not interested in the value for <task_id>,
>> you could consider setting that to a fixed value "task_id" [1] ?
>>
>> Best regards,
>>
>> Martijn
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>>
>> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>>
>>> Hi, Filip
>>>
>>> You can modify the InfluxdbReporter code to rewrite the
>>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>>
>>> Best,
>>> Weihua
>>>
>>>
>>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>>> series per:
>>>> -[task|job]manager
>>>> - host
>>>> - job_id
>>>> - job_name
>>>> - subtask_index
>>>> - task_attempt_id
>>>> - task_attempt_num
>>>> - task_id
>>>> - tm_id
>>>>
>>>> which amounts to about 4k of series each time our job restarts itself
>>>>
>>>> We are currently experiencing problems with checkpoint duration
>>>> timeouts (> 60s) (unrelated) and every 60 secs our job restarts and creates
>>>> further 4k series in influxdb.
>>>>
>>>> Needless to say, the team managing influxdb is not too happy with the
>>>> amount of series we create.
>>>>
>>>> Is there anything I can do to either reduce the number of series, or
>>>> reduce the number of types of metrics in order to produce fewer series? (we
>>>> don't view all the available metrics in grafana, so we don't necessarily
>>>> have to send all of them)
>>>>
>>>> The db caps at 1M series, and with our current problems with
>>>> checkpointing we go through that many in a matter of hours
>>>>
>>>> Many thanks
>>>> Fil
>>>>
>>>>
Re: influxdb metrics reporter - 4k series per job restart
Posted by Mason Chen <ma...@gmail.com>.
Hi all,
If you can wait for Flink 1.16, there is a new feature to filter metrics
(includes/excludes filter). Additionally, you can already take advantage of
dropping unnecessary labels with `scope.variables.excludes` in the current
release. Link to 1.16 metric features:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#reporter
Best,
Mason
On Fri, Jul 1, 2022 at 3:55 AM Martijn Visser <ma...@apache.org>
wrote:
> Have you considered setting the value for some of the series to a fixed
> value? For example, if you're not interested in the value for <task_id>,
> you could consider setting that to a fixed value "task_id" [1] ?
>
> Best regards,
>
> Martijn
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
>
> Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
>
>> Hi, Filip
>>
>> You can modify the InfluxdbReporter code to rewrite the
>> notifyOfAddedMetric method and filter the required metrics for reporting.
>>
>> Best,
>> Weihua
>>
>>
>> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> We're using the influx reporter (flink 1.14.3), which seems to create a
>>> series per:
>>> -[task|job]manager
>>> - host
>>> - job_id
>>> - job_name
>>> - subtask_index
>>> - task_attempt_id
>>> - task_attempt_num
>>> - task_id
>>> - tm_id
>>>
>>> which amounts to about 4k of series each time our job restarts itself
>>>
>>> We are currently experiencing problems with checkpoint duration timeouts
>>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>>> 4k series in influxdb.
>>>
>>> Needless to say, the team managing influxdb is not too happy with the
>>> amount of series we create.
>>>
>>> Is there anything I can do to either reduce the number of series, or
>>> reduce the number of types of metrics in order to produce fewer series? (we
>>> don't view all the available metrics in grafana, so we don't necessarily
>>> have to send all of them)
>>>
>>> The db caps at 1M series, and with our current problems with
>>> checkpointing we go through that many in a matter of hours
>>>
>>> Many thanks
>>> Fil
>>>
>>>
Re: influxdb metrics reporter - 4k series per job restart
Posted by Martijn Visser <ma...@apache.org>.
Have you considered setting the value for some of the series to a fixed
value? For example, if you're not interested in the value for <task_id>,
you could consider setting that to a fixed value "task_id" [1] ?
Best regards,
Martijn
[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope
Op do 30 jun. 2022 om 15:52 schreef Weihua Hu <hu...@gmail.com>:
> Hi, Filip
>
> You can modify the InfluxdbReporter code to rewrite the
> notifyOfAddedMetric method and filter the required metrics for reporting.
>
> Best,
> Weihua
>
>
> On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
> wrote:
>
>> Hi All
>>
>> We're using the influx reporter (flink 1.14.3), which seems to create a
>> series per:
>> -[task|job]manager
>> - host
>> - job_id
>> - job_name
>> - subtask_index
>> - task_attempt_id
>> - task_attempt_num
>> - task_id
>> - tm_id
>>
>> which amounts to about 4k of series each time our job restarts itself
>>
>> We are currently experiencing problems with checkpoint duration timeouts
>> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
>> 4k series in influxdb.
>>
>> Needless to say, the team managing influxdb is not too happy with the
>> amount of series we create.
>>
>> Is there anything I can do to either reduce the number of series, or
>> reduce the number of types of metrics in order to produce fewer series? (we
>> don't view all the available metrics in grafana, so we don't necessarily
>> have to send all of them)
>>
>> The db caps at 1M series, and with our current problems with
>> checkpointing we go through that many in a matter of hours
>>
>> Many thanks
>> Fil
>>
>>
Re: influxdb metrics reporter - 4k series per job restart
Posted by Weihua Hu <hu...@gmail.com>.
Hi, Filip
You can modify the InfluxdbReporter code to rewrite the notifyOfAddedMetric
method and filter the required metrics for reporting.
Best,
Weihua
On Thu, Jun 30, 2022 at 8:46 PM Filip Karnicki <fi...@gmail.com>
wrote:
> Hi All
>
> We're using the influx reporter (flink 1.14.3), which seems to create a
> series per:
> -[task|job]manager
> - host
> - job_id
> - job_name
> - subtask_index
> - task_attempt_id
> - task_attempt_num
> - task_id
> - tm_id
>
> which amounts to about 4k of series each time our job restarts itself
>
> We are currently experiencing problems with checkpoint duration timeouts
> (> 60s) (unrelated) and every 60 secs our job restarts and creates further
> 4k series in influxdb.
>
> Needless to say, the team managing influxdb is not too happy with the
> amount of series we create.
>
> Is there anything I can do to either reduce the number of series, or
> reduce the number of types of metrics in order to produce fewer series? (we
> don't view all the available metrics in grafana, so we don't necessarily
> have to send all of them)
>
> The db caps at 1M series, and with our current problems with checkpointing
> we go through that many in a matter of hours
>
> Many thanks
> Fil
>
>