You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Roesler (JIRA)" <ji...@apache.org> on 2018/11/20 15:57:00 UTC

[jira] [Comment Edited] (KAFKA-7660) Stream Metrics - Memory Analysis

    [ https://issues.apache.org/jira/browse/KAFKA-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693398#comment-16693398 ] 

John Roesler edited comment on KAFKA-7660 at 11/20/18 3:56 PM:
---------------------------------------------------------------

Hi [~pkleindl],

Thanks for looking at this!

I don't have any hard answers right now, but I'll share some context, which might help you make an argument for whether this can be improved or not...

The basic structure of metering in Kafka is that you:
 # create a sensor, which is essentially just a container for metrics
 # add metrics to the sensor. Each metric has a name, consisting of a "name", a "group", and a set of tags (and a description).
 # keep a reference to the sensor, and call "sensor.record()" to make a measurement. When you do this, the sensor updates each of its metrics.
 # sensors can have parents, which allows you to maintain aggregated metrics. When you record a measurement on a child sensor, it propagates to the parent.

 

1)

The strings you listed are part of our metric names, either tags or the "type" (aka "group" in the code). See [https://docs.confluent.io/current/streams/monitoring.html#processor-node-metrics] for example.

You noticed it in Sensors.java, but that's just one place that some metrics are defined. Most of the Processor Node Metrics are defined in [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java#L165]

Most of the time, the group name and tags are produced by string concatenation, which might be why they didn't turn up in your search. Actually, that also might be why the string isn't interned. 

 

2)

When you start Streams, we create a "whole bunch" of components with accompanying metrics. In many cases, we keep a pair of a more granular child sensor and a coarser parent sensor, with the child set to "debug" (silent by default).

Also, when you stop (aka close) Streams, we unload all the metrics (as well, when a task migrates to another instance, it unloads all its metrics on the first machine).

IIRC, we only keep the reference to the child sensor, so we need "parentSensors" to be able to remove the parent sensor once the last child is gone, or something like that. So it should only be keeping alive objects that we actually do need to be kept alive.

That said, I think I fixed a memory leak bug where we weren't actually ever removing sensors from that map. I don't recall what version it was, so that might well be what you are seeing. Maybe take a look at trunk or the 2.1 branch for comparison. (2.1 is in process of getting released)

 

Like I said at the start, this is just for context. I think it would take further analysis to decide whether these objects constitute a problem for GC or not. One thought I have is that for both the metric names and the parent/child sensors, these objects are very long-lived. They're essentially permanent, so they may not be causing GC pressure. 15MB worth of "stream-processor-node-metrics" seems a bit wasteful with heap, though... 

 

If you're interested in attempting to improve the memory footprint, we could discuss some potential directions and some experiments to try. On the other hand, if you just wanted to let us know what you observed, that's fine, too!


was (Author: vvcephei):
Hi [~pkleindl],

Thanks for looking at this!

I don't have any hard answers right now, but I'll share some context, which might help you make an argument for whether this can be improved or not...

The basic structure of metering in Kafka is that you:
 # create a sensor, which is essentially just a container for metrics
 # add metrics to the sensor. Each metric has a name, consisting of a "name", a "group", and a set of tags (and a description).
 # keep a reference to the sensor, and call "sensor.record()" to make a measurement. When you do this, the sensor updates each of its metrics.
 # sensors can have parents, which allows you to maintain aggregated metrics. When you record a measurement on a child sensor, it propagates to the parent.

 

1)

The strings you listed are part of our metric names, either tags or the "type" (aka "group" in the code). See [https://docs.confluent.io/current/streams/monitoring.html#processor-node-metrics] for example.

You noticed it in Sensors.java, but that's just one place that some metrics are defined. Most of the Processor Node Metrics are defined in [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java#L165]

Most of the time, the group name and tags are produced by string concatenation, which might be why they didn't turn up in your search. Actually, that also might be why the string isn't interned. 

 

2)

When you start Streams, we create a "whole bunch" of components with accompanying metrics. In many cases, we keep a pair of a more granular child sensor and a coarser parent sensor, with the child set to "debug" (silent by default).

Also, when you stop (aka close) Streams, we unload all the metrics (as well, when a task migrates to another instance, it unloads all its metrics on the first machine).

IIRC, we only keep the reference to the child sensor, so we need "parentSensors" to be able to remove the parent sensor once the last child is gone, or something like that. So it should only be keeping alive objects that we actually do need to be kept alive.

That said, I think I fixed a memory leak bug where we weren't actually ever removing sensors from that map. I don't recall what version it was, so that might well be what you are seeing. Maybe take a look at trunk or the 2.1 branch for comparison. (2.1 is in process of getting released)

 

Like I said at the start, this is just for context. I think it would take further analysis to decide whether these objects constitute a problem for GC or not. One thought I have is that for both the metric names and the parent/child sensors, these objects are very long-lived. They're essentially permanent, so they may not be causing GC pressure. 15MB worth of "stream-processor-node-metrics" seems a bit wasteful with heap, though... 

> Stream Metrics - Memory Analysis
> --------------------------------
>
>                 Key: KAFKA-7660
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7660
>             Project: Kafka
>          Issue Type: Bug
>          Components: metrics, streams
>    Affects Versions: 2.0.0
>            Reporter: Patrik Kleindl
>            Priority: Minor
>         Attachments: Mem_Collections.jpeg, Mem_DuplicateStrings.jpeg, Mem_DuplicateStrings2.jpeg, Mem_Hotspots.jpeg, Mem_KeepAliveSet.jpeg, Mem_References.jpeg
>
>
> During the analysis of JVM memory two possible issues were shown which I would like to bring to your attention:
> 1) Duplicate strings
> Top findings: 
> string_content="stream-processor-node-metrics" count="534,277"
> string_content="processor-node-id" count="148,437"
> string_content="stream-rocksdb-state-metrics" count="41,832"
> string_content="punctuate-latency-avg" count="29,681" 
>  
> "stream-processor-node-metrics"  seems to be used in Sensors.java as a literal and not interned.
>  
> 2) The HashMap parentSensors from org.apache.kafka.streams.processor.internals.StreamThread$StreamsMetricsThreadImpl was reported multiple times as suspicious for potentially keeping alive a lot of objects. In our case the reported size was 40-50MB each.
> I haven't looked too deep in the code but noticed that the class Sensor.java which is used as a key in the HashMap does not implement equals or hashCode method. Not sure this is a problem though.
>  
> The analysis was done with Dynatrace 7.0
> We are running Confluent 5.0/Kafka2.0-cp1 (Brokers as well as Clients)
>  
> Screenshots are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)