You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Sitan Pang (Jira)" <ji...@apache.org> on 2022/08/30 06:44:00 UTC

[jira] [Created] (FLINK-29134) fetch metrics may cause oom(ThreadPool task pile up)

Sitan Pang created FLINK-29134:
----------------------------------

             Summary: fetch metrics may cause oom(ThreadPool task pile up)
                 Key: FLINK-29134
                 URL: https://issues.apache.org/jira/browse/FLINK-29134
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Metrics
    Affects Versions: 1.11.0
            Reporter: Sitan Pang
         Attachments: dump-queueTask.png, dump-threadPool.png

When we queryMetrics we use thread pool to process the data which are returned by TMs. 

 
{code:java}
private void queryMetrics(final MetricQueryServiceGateway queryServiceGateway) {
    LOG.debug("Query metrics for {}.", queryServiceGateway.getAddress());

    queryServiceGateway
            .queryMetrics(timeout)
            .whenCompleteAsync(
                    (MetricDumpSerialization.MetricSerializationResult result, Throwable t) -> {
                        if (t != null) {
                            LOG.debug("Fetching metrics failed.", t);
                        } else {
                            metrics.addAll(deserializer.deserialize(result));
                        }
                    },
                    executor);
} {code}
The only condition we will fetch metrics is update time is larger than updateInterval

 
{code:java}
public void update() {
    synchronized (this) {
        long currentTime = System.currentTimeMillis();
        if (currentTime - lastUpdateTime > updateInterval) {
            lastUpdateTime = currentTime;
            fetchMetrics();
        }
    }
} {code}
Therefore, if we could not process the data in update-interval-time, metrics data will accumulate.

Besides, Rest handler and metrics share thread pool. When we open ui, it maybe even worse.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)