You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Marouane RAJI (JIRA)" <ji...@apache.org> on 2019/07/01 08:55:00 UTC

[jira] [Created] (SAMZA-2265) Memory leak potentially due to Kafka Checkpoint Management

Marouane RAJI created SAMZA-2265:
------------------------------------

             Summary: Memory leak potentially due to Kafka Checkpoint Management
                 Key: SAMZA-2265
                 URL: https://issues.apache.org/jira/browse/SAMZA-2265
             Project: Samza
          Issue Type: Bug
    Affects Versions: 1.0, 1.1
         Environment:  

```

job.container.count : 110

yarn.container.memory.mb=4000
yarn.container.cpu.cores=8
yarn.am.container.cpu.cores=8
yarn.am.container.memory.mb=1024
task.opts=-Xmx2800M
task.checkpoint.replication.factor=2

 ```
            Reporter: Marouane RAJI
         Attachments: image-2019-07-01-09-47-11-241.png, image-2019-07-01-09-48-45-876.png, image-2019-07-01-09-50-04-693.png

Hi, 

We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 then to 1.1. It seems that in both later versions we would have a memory leak. This ever-increasing memory would lead to containers failing/ yarn restarting them.
It is worth noticing that we upgraded other smaller (in container specs and throughput) samza jobs without any issues.





specs about job : 
 * reading ~70k msg/sec 
 * 211 input topic , including one broadcasting one (2 msg/day, used for config updates)
 * 1 output topic.

Below, memory consumption in both versions for one container

!image-2019-07-01-09-47-11-241.png!

 

Heap-dumps comparison: 

!image-2019-07-01-09-48-45-876.png!

 

The difference between both version keep increasing slowly, the main cause of that in the increase in byte[]



In the 1.0 and 1.1 version the main reference holding these bytes seems to be  KafkaCheckpointManager: 
!image-2019-07-01-09-50-04-693.png!

 

Could this PR solves this issues [https://github.com/apache/samza/pull/993] ? as, we would be releasing KafkaConsumer used for checkpointing ? 

Thanks. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)