You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Vishal Santoshi <vi...@gmail.com> on 2021/03/20 12:44:17 UTC

DataDog and Flink

Hello folks,
                  This is quite strange. We see a TM stop reporting metrics
to DataDog .The logs from that specific TM  for every DataDog dispatch time
out with* java.net.SocketTimeoutException: timeout *and that seems to
repeat over every dispatch to DataDog. It seems it is on a 10 seconds
cadence per container. The TM remains humming, so does not seem to be under
memory/CPU distress. And the exception is *not* transient. It just stops
dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty
much not be a SocketTimeOut, till of course the reporting the specific
issue is off. This thus appears very much a n/w issue which appears weird
as other TMs with the same n/w just hum along, sending their metrics
successfully. The other issue could be just the amount of metrics and the
current volume for the TM is prohibitive. That said the exception is still
not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess
even best practices may be a sufficient beginning.

Regards.

Re: DataDog and Flink

Posted by Vishal Santoshi <vi...@gmail.com>.

yep, not a single EP that does all the dump but something like this works (
dirty but who cares :)) ..  The vertex metrics are the most numerous any
way
```curl -s  http://xxxx/jobs/[job_id] | jq -r '.vertices' | jq
'.[].id' |  xargs
-I {}  curl http://xxxxxx/jobs/[job_id]/vertices/{}/metrics | jq

On Wed, Mar 24, 2021 at 9:56 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> Yes, I will do that.
>
> Regarding the metrics dump through REST, it does provide for the TM
> specific but  refuses to do it for all jobs and vertices/operators etc
> .Moreover I am not sure I have access to the vertices ( vertex_id ) readily
> from the UI.
>
> curl http://[jm]/taskmanagers/[tm_id]
> curl http://[jm]/taskmanagers/[tm_id]/metrics
>
>
>
> On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Vishal,
>>
>> REST API is the most direct way to get through all metrics as Matthias
>> pointed out. Additionally, you could also add a JMX reporter and log to the
>> machines to check.
>>
>> But in general, I think you are on the right track. You need to reduce
>> the metrics that are sent to DD by configuring the scope / excluding
>> variables.
>>
>> Furthermore, I think it would be a good idea to make the timeout
>> configurable. Could you open a ticket for that?
>>
>> Best,
>>
>> Arvid
>>
>> On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> Hi Vishal,
>>> what about the TM metrics' REST endpoint [1]. Is this something you
>>> could use to get all the metrics for a specific TaskManager? Or are you
>>> looking for something else?
>>>
>>> Best,
>>> Matthias
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics
>>>
>>> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> That said, is there a way to get a dump of all metrics exposed by TM. I
>>>> was searching for it and I bet we could get it for ServieMonitor on k8s (
>>>> scrape ) but am missing a way to het a TM and dump all metrics that are
>>>> pushed.
>>>>
>>>> Thanks and regards.
>>>>
>>>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> I guess there is a bigger issue here. We dropped the property to 500.
>>>>> We also realized that this failure happened on a TM that had one specific
>>>>> job running on it. What was good ( but surprising ) that the exception was
>>>>> the more protocol specific 413  ( as in the chunk is greater then some size
>>>>> limit DD has on a request.
>>>>>
>>>>> Failed to send request to Datadog (response was Response{protocol=h2,
>>>>> code=413, message=, url=
>>>>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>>>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>>>>> )
>>>>>
>>>>> which implies that the Socket timeout was masking this issue. The 2000
>>>>> was just a huge payload that DD was unable to parse in time ( or was slow
>>>>> to upload etc ). Now we could go lower but that makes less sense. We could
>>>>> play with
>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>>>>> to reduce the size of the tags ( or keys ).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> If we look at this
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>>>>> code , the metrics are divided into chunks up-to a max size. and
>>>>>> enqueued
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>>>>> The Request
>>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>>>>> has a 3 second read/connect/write timeout which IMHO should have been
>>>>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>>>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>>>>> along with tags ) , it may make sense to limit the number of metrics in a
>>>>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>>>>> this configuration which allows for reducing the metrics in a single chunk
>>>>>>
>>>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>>>>
>>>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>>>>> inevitable that the number of requests will grow and we may hit the
>>>>>> throttle but then we know the exception rather than the timeouts that are
>>>>>> generally less intuitive.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> I have no experience in the Flink+DataDog setup but worked a bit
>>>>>>> with DataDog before.
>>>>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>>>>> suspect n/w issues.
>>>>>>> Can you log into the TM's machine and try out manually how the
>>>>>>> system behaves?
>>>>>>>
>>>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello folks,
>>>>>>>>                   This is quite strange. We see a TM stop reporting
>>>>>>>> metrics to DataDog .The logs from that specific TM  for every
>>>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException:
>>>>>>>> timeout *and that seems to repeat over every dispatch to DataDog.
>>>>>>>> It seems it is on a 10 seconds cadence per container. The TM remains
>>>>>>>> humming, so does not seem to be under memory/CPU distress. And the
>>>>>>>> exception is *not* transient. It just stops dead and from there on
>>>>>>>> timeout.
>>>>>>>>
>>>>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>>>>> appears weird as other TMs with the same n/w just hum along, sending their
>>>>>>>> metrics successfully. The other issue could be just the amount of metrics
>>>>>>>> and the current volume for the TM is prohibitive. That said the exception
>>>>>>>> is still not helpful.
>>>>>>>>
>>>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>>>>> guess even best practices may be a sufficient beginning.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>

Re: DataDog and Flink

Posted by Vishal Santoshi <vi...@gmail.com>.

Yes, I will do that.

Regarding the metrics dump through REST, it does provide for the TM
specific but  refuses to do it for all jobs and vertices/operators etc
.Moreover I am not sure I have access to the vertices ( vertex_id ) readily
from the UI.

curl http://[jm]/taskmanagers/[tm_id]
curl http://[jm]/taskmanagers/[tm_id]/metrics



On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <ar...@apache.org> wrote:

> Hi Vishal,
>
> REST API is the most direct way to get through all metrics as Matthias
> pointed out. Additionally, you could also add a JMX reporter and log to the
> machines to check.
>
> But in general, I think you are on the right track. You need to reduce the
> metrics that are sent to DD by configuring the scope / excluding variables.
>
> Furthermore, I think it would be a good idea to make the timeout
> configurable. Could you open a ticket for that?
>
> Best,
>
> Arvid
>
> On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Hi Vishal,
>> what about the TM metrics' REST endpoint [1]. Is this something you could
>> use to get all the metrics for a specific TaskManager? Or are you looking
>> for something else?
>>
>> Best,
>> Matthias
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics
>>
>> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> That said, is there a way to get a dump of all metrics exposed by TM. I
>>> was searching for it and I bet we could get it for ServieMonitor on k8s (
>>> scrape ) but am missing a way to het a TM and dump all metrics that are
>>> pushed.
>>>
>>> Thanks and regards.
>>>
>>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> I guess there is a bigger issue here. We dropped the property to 500.
>>>> We also realized that this failure happened on a TM that had one specific
>>>> job running on it. What was good ( but surprising ) that the exception was
>>>> the more protocol specific 413  ( as in the chunk is greater then some size
>>>> limit DD has on a request.
>>>>
>>>> Failed to send request to Datadog (response was Response{protocol=h2,
>>>> code=413, message=, url=
>>>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>>>> )
>>>>
>>>> which implies that the Socket timeout was masking this issue. The 2000
>>>> was just a huge payload that DD was unable to parse in time ( or was slow
>>>> to upload etc ). Now we could go lower but that makes less sense. We could
>>>> play with
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>>>> to reduce the size of the tags ( or keys ).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> If we look at this
>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>>>> code , the metrics are divided into chunks up-to a max size. and
>>>>> enqueued
>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>>>> The Request
>>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>>>> has a 3 second read/connect/write timeout which IMHO should have been
>>>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>>>> along with tags ) , it may make sense to limit the number of metrics in a
>>>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>>>> this configuration which allows for reducing the metrics in a single chunk
>>>>>
>>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>>>
>>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>>>> inevitable that the number of requests will grow and we may hit the
>>>>> throttle but then we know the exception rather than the timeouts that are
>>>>> generally less intuitive.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> I have no experience in the Flink+DataDog setup but worked a bit with
>>>>>> DataDog before.
>>>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>>>> suspect n/w issues.
>>>>>> Can you log into the TM's machine and try out manually how the system
>>>>>> behaves?
>>>>>>
>>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> Hello folks,
>>>>>>>                   This is quite strange. We see a TM stop reporting
>>>>>>> metrics to DataDog .The logs from that specific TM  for every
>>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException:
>>>>>>> timeout *and that seems to repeat over every dispatch to DataDog.
>>>>>>> It seems it is on a 10 seconds cadence per container. The TM remains
>>>>>>> humming, so does not seem to be under memory/CPU distress. And the
>>>>>>> exception is *not* transient. It just stops dead and from there on
>>>>>>> timeout.
>>>>>>>
>>>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>>>> appears weird as other TMs with the same n/w just hum along, sending their
>>>>>>> metrics successfully. The other issue could be just the amount of metrics
>>>>>>> and the current volume for the TM is prohibitive. That said the exception
>>>>>>> is still not helpful.
>>>>>>>
>>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>>>> guess even best practices may be a sufficient beginning.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>

Re: DataDog and Flink

Posted by Arvid Heise <ar...@apache.org>.

Hi Vishal,

REST API is the most direct way to get through all metrics as Matthias
pointed out. Additionally, you could also add a JMX reporter and log to the
machines to check.

But in general, I think you are on the right track. You need to reduce the
metrics that are sent to DD by configuring the scope / excluding variables.

Furthermore, I think it would be a good idea to make the timeout
configurable. Could you open a ticket for that?

Best,

Arvid

On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <ma...@ververica.com>
wrote:

> Hi Vishal,
> what about the TM metrics' REST endpoint [1]. Is this something you could
> use to get all the metrics for a specific TaskManager? Or are you looking
> for something else?
>
> Best,
> Matthias
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics
>
> On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> That said, is there a way to get a dump of all metrics exposed by TM. I
>> was searching for it and I bet we could get it for ServieMonitor on k8s (
>> scrape ) but am missing a way to het a TM and dump all metrics that are
>> pushed.
>>
>> Thanks and regards.
>>
>> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> I guess there is a bigger issue here. We dropped the property to 500. We
>>> also realized that this failure happened on a TM that had one specific job
>>> running on it. What was good ( but surprising ) that the exception was the
>>> more protocol specific 413  ( as in the chunk is greater then some size
>>> limit DD has on a request.
>>>
>>> Failed to send request to Datadog (response was Response{protocol=h2,
>>> code=413, message=, url=
>>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>>> )
>>>
>>> which implies that the Socket timeout was masking this issue. The 2000
>>> was just a huge payload that DD was unable to parse in time ( or was slow
>>> to upload etc ). Now we could go lower but that makes less sense. We could
>>> play with
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>>> to reduce the size of the tags ( or keys ).
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> If we look at this
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>>> code , the metrics are divided into chunks up-to a max size. and
>>>> enqueued
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>>> The Request
>>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>>> has a 3 second read/connect/write timeout which IMHO should have been
>>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>>> along with tags ) , it may make sense to limit the number of metrics in a
>>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>>> this configuration which allows for reducing the metrics in a single chunk
>>>>
>>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>>
>>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>>> inevitable that the number of requests will grow and we may hit the
>>>> throttle but then we know the exception rather than the timeouts that are
>>>> generally less intuitive.
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>
>>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> I have no experience in the Flink+DataDog setup but worked a bit with
>>>>> DataDog before.
>>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>>> suspect n/w issues.
>>>>> Can you log into the TM's machine and try out manually how the system
>>>>> behaves?
>>>>>
>>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> Hello folks,
>>>>>>                   This is quite strange. We see a TM stop reporting
>>>>>> metrics to DataDog .The logs from that specific TM  for every
>>>>>> DataDog dispatch time out with* java.net.SocketTimeoutException:
>>>>>> timeout *and that seems to repeat over every dispatch to DataDog. It
>>>>>> seems it is on a 10 seconds cadence per container. The TM remains humming,
>>>>>> so does not seem to be under memory/CPU distress. And the exception is
>>>>>> *not* transient. It just stops dead and from there on timeout.
>>>>>>
>>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>>> appears weird as other TMs with the same n/w just hum along, sending their
>>>>>> metrics successfully. The other issue could be just the amount of metrics
>>>>>> and the current volume for the TM is prohibitive. That said the exception
>>>>>> is still not helpful.
>>>>>>
>>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>>> guess even best practices may be a sufficient beginning.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>

Re: DataDog and Flink

Posted by Matthias Pohl <ma...@ververica.com>.

Hi Vishal,
what about the TM metrics' REST endpoint [1]. Is this something you could
use to get all the metrics for a specific TaskManager? Or are you looking
for something else?

Best,
Matthias

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#taskmanagers-metrics

On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> That said, is there a way to get a dump of all metrics exposed by TM. I
> was searching for it and I bet we could get it for ServieMonitor on k8s (
> scrape ) but am missing a way to het a TM and dump all metrics that are
> pushed.
>
> Thanks and regards.
>
> On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> I guess there is a bigger issue here. We dropped the property to 500. We
>> also realized that this failure happened on a TM that had one specific job
>> running on it. What was good ( but surprising ) that the exception was the
>> more protocol specific 413  ( as in the chunk is greater then some size
>> limit DD has on a request.
>>
>> Failed to send request to Datadog (response was Response{protocol=h2,
>> code=413, message=, url=
>> https://app.datadoghq.com/api/v1/series?api_key=**********}
>> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
>> )
>>
>> which implies that the Socket timeout was masking this issue. The 2000
>> was just a huge payload that DD was unable to parse in time ( or was slow
>> to upload etc ). Now we could go lower but that makes less sense. We could
>> play with
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
>> to reduce the size of the tags ( or keys ).
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> If we look at this
>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>>> code , the metrics are divided into chunks up-to a max size. and
>>> enqueued
>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>>> The Request
>>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>>> has a 3 second read/connect/write timeout which IMHO should have been
>>> configurable ( or is it ) . While the number metrics ( all metrics )
>>> exposed by flink cluster is pretty high ( and the names of the metrics
>>> along with tags ) , it may make sense to limit the number of metrics in a
>>> single chunk ( to ultimately limit the size of a single chunk ). There is
>>> this configuration which allows for reducing the metrics in a single chunk
>>>
>>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>>
>>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>>> inevitable that the number of requests will grow and we may hit the
>>> throttle but then we know the exception rather than the timeouts that are
>>> generally less intuitive.
>>>
>>> Any thoughts?
>>>
>>>
>>>
>>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> I have no experience in the Flink+DataDog setup but worked a bit with
>>>> DataDog before.
>>>> I'd agree that the timeout does not seem like a rate limit. It would
>>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>>> suspect n/w issues.
>>>> Can you log into the TM's machine and try out manually how the system
>>>> behaves?
>>>>
>>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> Hello folks,
>>>>>                   This is quite strange. We see a TM stop reporting
>>>>> metrics to DataDog .The logs from that specific TM  for every DataDog
>>>>> dispatch time out with* java.net.SocketTimeoutException: timeout *and
>>>>> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
>>>>> seconds cadence per container. The TM remains humming, so does not seem to
>>>>> be under memory/CPU distress. And the exception is *not* transient.
>>>>> It just stops dead and from there on timeout.
>>>>>
>>>>> Looking at SLA provided by DataDog any throttling exception should
>>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>>> specific issue is off. This thus appears very much a n/w issue which
>>>>> appears weird as other TMs with the same n/w just hum along, sending their
>>>>> metrics successfully. The other issue could be just the amount of metrics
>>>>> and the current volume for the TM is prohibitive. That said the exception
>>>>> is still not helpful.
>>>>>
>>>>> Any ideas from folks who have used DataDog reporter with Flink. I
>>>>> guess even best practices may be a sufficient beginning.
>>>>>
>>>>> Regards.
>>>>>
>>>>>

Re: DataDog and Flink

Posted by Vishal Santoshi <vi...@gmail.com>.

That said, is there a way to get a dump of all metrics exposed by TM. I was
searching for it and I bet we could get it for ServieMonitor on k8s (
scrape ) but am missing a way to het a TM and dump all metrics that are
pushed.

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> I guess there is a bigger issue here. We dropped the property to 500. We
> also realized that this failure happened on a TM that had one specific job
> running on it. What was good ( but surprising ) that the exception was the
> more protocol specific 413  ( as in the chunk is greater then some size
> limit DD has on a request.
>
> Failed to send request to Datadog (response was Response{protocol=h2,
> code=413, message=, url=
> https://app.datadoghq.com/api/v1/series?api_key=**********}
> <https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6%7D>
> )
>
> which implies that the Socket timeout was masking this issue. The 2000 was
> just a huge payload that DD was unable to parse in time ( or was slow to
> upload etc ). Now we could go lower but that makes less sense. We could
> play with
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
> to reduce the size of the tags ( or keys ).
>
>
>
>
>
>
>
>
>
> On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <
> vishal.santoshi@gmail.com> wrote:
>
>> If we look at this
>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
>> code , the metrics are divided into chunks up-to a max size. and enqueued
>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
>> The Request
>> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
>> has a 3 second read/connect/write timeout which IMHO should have been
>> configurable ( or is it ) . While the number metrics ( all metrics )
>> exposed by flink cluster is pretty high ( and the names of the metrics
>> along with tags ) , it may make sense to limit the number of metrics in a
>> single chunk ( to ultimately limit the size of a single chunk ). There is
>> this configuration which allows for reducing the metrics in a single chunk
>>
>> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>>
>> We could decrease this to 1500 ( 1500 is pretty, not based on any
>> empirical reasoning ) and see if that stabilizes the dispatch. It is
>> inevitable that the number of requests will grow and we may hit the
>> throttle but then we know the exception rather than the timeouts that are
>> generally less intuitive.
>>
>> Any thoughts?
>>
>>
>>
>> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>>
>>> Hi Vishal,
>>>
>>> I have no experience in the Flink+DataDog setup but worked a bit with
>>> DataDog before.
>>> I'd agree that the timeout does not seem like a rate limit. It would
>>> also be odd that the other TMs with a similar rate still pass. So I'd
>>> suspect n/w issues.
>>> Can you log into the TM's machine and try out manually how the system
>>> behaves?
>>>
>>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> Hello folks,
>>>>                   This is quite strange. We see a TM stop reporting
>>>> metrics to DataDog .The logs from that specific TM  for every DataDog
>>>> dispatch time out with* java.net.SocketTimeoutException: timeout *and
>>>> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
>>>> seconds cadence per container. The TM remains humming, so does not seem to
>>>> be under memory/CPU distress. And the exception is *not* transient. It
>>>> just stops dead and from there on timeout.
>>>>
>>>> Looking at SLA provided by DataDog any throttling exception should
>>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>>> specific issue is off. This thus appears very much a n/w issue which
>>>> appears weird as other TMs with the same n/w just hum along, sending their
>>>> metrics successfully. The other issue could be just the amount of metrics
>>>> and the current volume for the TM is prohibitive. That said the exception
>>>> is still not helpful.
>>>>
>>>> Any ideas from folks who have used DataDog reporter with Flink. I guess
>>>> even best practices may be a sufficient beginning.
>>>>
>>>> Regards.
>>>>
>>>>

Re: DataDog and Flink

Posted by Vishal Santoshi <vi...@gmail.com>.

I guess there is a bigger issue here. We dropped the property to 500. We
also realized that this failure happened on a TM that had one specific job
running on it. What was good ( but surprising ) that the exception was the
more protocol specific 413  ( as in the chunk is greater then some size
limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2,
code=413, message=, url=
https://app.datadoghq.com/api/v1/series?api_key=**********}
<https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6}>
)

which implies that the Socket timeout was masking this issue. The 2000 was
just a huge payload that DD was unable to parse in time ( or was slow to
upload etc ). Now we could go lower but that makes less sense. We could
play with
https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
to reduce the size of the tags ( or keys ).









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <vi...@gmail.com>
wrote:

> If we look at this
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
> code , the metrics are divided into chunks up-to a max size. and enqueued
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
> The Request
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
> has a 3 second read/connect/write timeout which IMHO should have been
> configurable ( or is it ) . While the number metrics ( all metrics )
> exposed by flink cluster is pretty high ( and the names of the metrics
> along with tags ) , it may make sense to limit the number of metrics in a
> single chunk ( to ultimately limit the size of a single chunk ). There is
> this configuration which allows for reducing the metrics in a single chunk
>
> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>
> We could decrease this to 1500 ( 1500 is pretty, not based on any
> empirical reasoning ) and see if that stabilizes the dispatch. It is
> inevitable that the number of requests will grow and we may hit the
> throttle but then we know the exception rather than the timeouts that are
> generally less intuitive.
>
> Any thoughts?
>
>
>
> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Vishal,
>>
>> I have no experience in the Flink+DataDog setup but worked a bit with
>> DataDog before.
>> I'd agree that the timeout does not seem like a rate limit. It would also
>> be odd that the other TMs with a similar rate still pass. So I'd suspect
>> n/w issues.
>> Can you log into the TM's machine and try out manually how the system
>> behaves?
>>
>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> Hello folks,
>>>                   This is quite strange. We see a TM stop reporting
>>> metrics to DataDog .The logs from that specific TM  for every DataDog
>>> dispatch time out with* java.net.SocketTimeoutException: timeout *and
>>> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
>>> seconds cadence per container. The TM remains humming, so does not seem to
>>> be under memory/CPU distress. And the exception is *not* transient. It
>>> just stops dead and from there on timeout.
>>>
>>> Looking at SLA provided by DataDog any throttling exception should
>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>> specific issue is off. This thus appears very much a n/w issue which
>>> appears weird as other TMs with the same n/w just hum along, sending their
>>> metrics successfully. The other issue could be just the amount of metrics
>>> and the current volume for the TM is prohibitive. That said the exception
>>> is still not helpful.
>>>
>>> Any ideas from folks who have used DataDog reporter with Flink. I guess
>>> even best practices may be a sufficient beginning.
>>>
>>> Regards.
>>>
>>>

Re: DataDog and Flink

Posted by Vishal Santoshi <vi...@gmail.com>.

If we look at this
<https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
code , the metrics are divided into chunks up-to a max size. and enqueued
<https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
The Request
<https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
has a 3 second read/connect/write timeout which IMHO should have been
configurable ( or is it ) . While the number metrics ( all metrics )
exposed by flink cluster is pretty high ( and the names of the metrics
along with tags ) , it may make sense to limit the number of metrics in a
single chunk ( to ultimately limit the size of a single chunk ). There is
this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical
reasoning ) and see if that stabilizes the dispatch. It is inevitable that
the number of requests will grow and we may hit the throttle but then we
know the exception rather than the timeouts that are generally less
intuitive.

Any thoughts?

On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:

> Hi Vishal,
>
> I have no experience in the Flink+DataDog setup but worked a bit with
> DataDog before.
> I'd agree that the timeout does not seem like a rate limit. It would also
> be odd that the other TMs with a similar rate still pass. So I'd suspect
> n/w issues.
> Can you log into the TM's machine and try out manually how the system
> behaves?
>
> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> Hello folks,
>>                   This is quite strange. We see a TM stop reporting
>> metrics to DataDog .The logs from that specific TM  for every DataDog
>> dispatch time out with* java.net.SocketTimeoutException: timeout *and
>> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
>> seconds cadence per container. The TM remains humming, so does not seem to
>> be under memory/CPU distress. And the exception is *not* transient. It
>> just stops dead and from there on timeout.
>>
>> Looking at SLA provided by DataDog any throttling exception should pretty
>> much not be a SocketTimeOut, till of course the reporting the specific
>> issue is off. This thus appears very much a n/w issue which appears weird
>> as other TMs with the same n/w just hum along, sending their metrics
>> successfully. The other issue could be just the amount of metrics and the
>> current volume for the TM is prohibitive. That said the exception is still
>> not helpful.
>>
>> Any ideas from folks who have used DataDog reporter with Flink. I guess
>> even best practices may be a sufficient beginning.
>>
>> Regards.
>>
>>

Re: DataDog and Flink

Posted by Arvid Heise <ar...@apache.org>.

Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with
DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also
be odd that the other TMs with a similar rate still pass. So I'd suspect
n/w issues.
Can you log into the TM's machine and try out manually how the system
behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> Hello folks,
>                   This is quite strange. We see a TM stop reporting
> metrics to DataDog .The logs from that specific TM  for every DataDog
> dispatch time out with* java.net.SocketTimeoutException: timeout *and
> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
> seconds cadence per container. The TM remains humming, so does not seem to
> be under memory/CPU distress. And the exception is *not* transient. It
> just stops dead and from there on timeout.
>
> Looking at SLA provided by DataDog any throttling exception should pretty
> much not be a SocketTimeOut, till of course the reporting the specific
> issue is off. This thus appears very much a n/w issue which appears weird
> as other TMs with the same n/w just hum along, sending their metrics
> successfully. The other issue could be just the amount of metrics and the
> current volume for the TM is prohibitive. That said the exception is still
> not helpful.
>
> Any ideas from folks who have used DataDog reporter with Flink. I guess
> even best practices may be a sufficient beginning.
>
> Regards.
>
>