You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Mathieu DESPRIEE (Jira)" <ji...@apache.org> on 2023/06/01 16:30:00 UTC

[jira] [Created] (FLINK-32242) Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead

Mathieu DESPRIEE created FLINK-32242:
----------------------------------------

             Summary: Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead
                 Key: FLINK-32242
                 URL: https://issues.apache.org/jira/browse/FLINK-32242
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Metrics
    Affects Versions: 1.15.2
         Environment: Flink 1.15.2, AWS EMR.
            Reporter: Mathieu DESPRIEE
         Attachments: image-2023-06-01-17-42-56-305.png, image-2023-06-01-17-54-45-900.png, image-2023-06-01-17-56-50-809.png

We're running a relatively small flink cluster (7 task-managers, 8 cores) and are using datadog for telemetry.

The numbers for outgoing traffic, between kafka producers, tasks activities, and host system metrics didn't add-up. After investigation, we discovered that this traffic was generated by the DatadogHttpReporter. 

We switched the reporter to an implementation using the java dogstatsd client (reporting to a datadog agent on each host).

Here are some numbers of outgoing traffic taken at a NAT gateway, between the cluster and the outside world. Before/after this change (all other things being equal):

!image-2023-06-01-17-56-50-809.png!

We're talking about 850MG in 5mn, so 10GB/h overhead here. That kind of traffic is not free on AWS...

Here the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole cluster)

!image-2023-06-01-17-54-45-900.png!

Reporting telemetry in json over http has a *HUGE* overhead. 

So I would strongly advocate to deprecate this reporter, and recommend users to use a dogstatsd-based implementation. There exist one ([https://github.com/aroch/flink-metrics-dogstatsd,] not tested). On our side, we developed our own that we can share if requested.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)