You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Dave Thomas (JIRA)" <ji...@apache.org> on 2017/02/17 23:05:44 UTC

[jira] [Updated] (KAFKA-4778) OOM on kafka-streams instances with high numbers of unreaped Record classes

     [ https://issues.apache.org/jira/browse/KAFKA-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Thomas updated KAFKA-4778:
-------------------------------
    Description: 
We have a stream processing app with ~8 source/sink stages operating roughly at the rate of 500k messages ingested/day (~4M across the 8 stages).

We get OOM eruptions once every ~18 hours. Note it is Linux triggering the OOM-killer, not the JVM terminating itself. 

It may be worth noting that stream processing uses ~50 mbytes while processing normally for hours on end, until the problem surfaces; then in ~20-30 sec memory grows suddenly from under 100 mbytes to >1 gig and does not shrink until the process is killed.

We are using supervisor to restart the instances.  Sometimes, the process dies immediately once stream processing resumes for the same reason, a process which could continue for minutes or hours.  This extended window has enabled us to capture a heap dump using jmap.

jhat's histogram feature reveals the following top objects in memory:

Class	Instance Count	Total Size
class [B	4070487	867857833
class [Ljava.lang.Object;	2066986	268036184
class [C	539519	92010932
class [S	1003290	80263028
class [I	508208	77821516
class java.nio.HeapByteBuffer	1506943	58770777
class org.apache.kafka.common.record.Record	1506783	36162792
class org.apache.kafka.clients.consumer.ConsumerRecord	528652	35948336
class org.apache.kafka.common.record.MemoryRecords$RecordsIterator	501742	32613230
class org.apache.kafka.common.record.LogEntry	2009373	32149968
class org.xerial.snappy.SnappyInputStream	501600	20565600
class java.io.DataInputStream	501742	20069680
class java.io.EOFException	501606	20064240
class java.util.ArrayDeque	501941	8031056
class java.lang.Long	516463	4131704

Note high on the list include org.apache.kafka.common.record.Record, 
org.apache.kafka.clients.consumer.ConsumerRecord,
org.apache.kafka.common.record.MemoryRecords$RecordsIterator,
org.apache.kafka.common.record.LogEntry

All of these contain 500k-1.5M instances.

There is nothing in stream processing logs that is distinctive (log levels are still at default).

Could it be references (weak, phantom, etc) causing these instances to not be garbage-collected?

Edit: to request a full heap dump (created using `jmap -dump:format=b,file=`), contact me directly at opensource@peoplemerge.com.  It is 2G.

  was:
We have a stream processing app with ~8 source/sink stages operating roughly at the rate of 500k messages ingested/day (~4M across the 8 stages).

We get OOM eruptions once every ~18 hours. Note it is Linux triggering the OOM-killer, not the JVM terminating itself. 

It may be worth noting that stream processing uses ~50 mbytes while processing normally for hours on end, until the problem surfaces; then in ~20-30 sec memory grows suddenly from under 100 mbytes to >1 gig and does not shrink until the process is killed.

We are using supervisor to restart the instances.  Sometimes, the process dies immediately once stream processing resumes for the same reason, a process which could continue for minutes or hours.  This extended window has enabled us to capture a heap dump using jmap.

jhat's histogram feature reveals the following top objects in memory:

Class	Instance Count	Total Size
class [B	4070487	867857833
class [Ljava.lang.Object;	2066986	268036184
class [C	539519	92010932
class [S	1003290	80263028
class [I	508208	77821516
class java.nio.HeapByteBuffer	1506943	58770777
class org.apache.kafka.common.record.Record	1506783	36162792
class org.apache.kafka.clients.consumer.ConsumerRecord	528652	35948336
class org.apache.kafka.common.record.MemoryRecords$RecordsIterator	501742	32613230
class org.apache.kafka.common.record.LogEntry	2009373	32149968
class org.xerial.snappy.SnappyInputStream	501600	20565600
class java.io.DataInputStream	501742	20069680
class java.io.EOFException	501606	20064240
class java.util.ArrayDeque	501941	8031056
class java.lang.Long	516463	4131704

Note high on the list include org.apache.kafka.common.record.Record, 
org.apache.kafka.clients.consumer.ConsumerRecord,
org.apache.kafka.common.record.MemoryRecords$RecordsIterator,
org.apache.kafka.common.record.LogEntry

All of these contain 500k-1.5M instances.

There is nothing in stream processing logs that is distinctive (log levels are still at default).

Could it be references (weak, phantom, etc) causing these instances to not be garbage-collected?


> OOM on kafka-streams instances with high numbers of unreaped Record classes
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-4778
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4778
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.10.1.1
>         Environment: AWS m3.large Ubuntu 16.04.1 LTS.  rocksDB on local SSD.  
> Kafka has 3 zk, 5 brokers.  
> stream processors are run with:
> -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
> Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
> Stream processors written in scala 2.11.8
>            Reporter: Dave Thomas
>
> We have a stream processing app with ~8 source/sink stages operating roughly at the rate of 500k messages ingested/day (~4M across the 8 stages).
> We get OOM eruptions once every ~18 hours. Note it is Linux triggering the OOM-killer, not the JVM terminating itself. 
> It may be worth noting that stream processing uses ~50 mbytes while processing normally for hours on end, until the problem surfaces; then in ~20-30 sec memory grows suddenly from under 100 mbytes to >1 gig and does not shrink until the process is killed.
> We are using supervisor to restart the instances.  Sometimes, the process dies immediately once stream processing resumes for the same reason, a process which could continue for minutes or hours.  This extended window has enabled us to capture a heap dump using jmap.
> jhat's histogram feature reveals the following top objects in memory:
> Class	Instance Count	Total Size
> class [B	4070487	867857833
> class [Ljava.lang.Object;	2066986	268036184
> class [C	539519	92010932
> class [S	1003290	80263028
> class [I	508208	77821516
> class java.nio.HeapByteBuffer	1506943	58770777
> class org.apache.kafka.common.record.Record	1506783	36162792
> class org.apache.kafka.clients.consumer.ConsumerRecord	528652	35948336
> class org.apache.kafka.common.record.MemoryRecords$RecordsIterator	501742	32613230
> class org.apache.kafka.common.record.LogEntry	2009373	32149968
> class org.xerial.snappy.SnappyInputStream	501600	20565600
> class java.io.DataInputStream	501742	20069680
> class java.io.EOFException	501606	20064240
> class java.util.ArrayDeque	501941	8031056
> class java.lang.Long	516463	4131704
> Note high on the list include org.apache.kafka.common.record.Record, 
> org.apache.kafka.clients.consumer.ConsumerRecord,
> org.apache.kafka.common.record.MemoryRecords$RecordsIterator,
> org.apache.kafka.common.record.LogEntry
> All of these contain 500k-1.5M instances.
> There is nothing in stream processing logs that is distinctive (log levels are still at default).
> Could it be references (weak, phantom, etc) causing these instances to not be garbage-collected?
> Edit: to request a full heap dump (created using `jmap -dump:format=b,file=`), contact me directly at opensource@peoplemerge.com.  It is 2G.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)