You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Roesler (JIRA)" <ji...@apache.org> on 2019/03/27 21:21:00 UTC
[jira] [Commented] (KAFKA-8165) Streams task causes Out Of Memory after connection issues and store restoration

    [ https://issues.apache.org/jira/browse/KAFKA-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803370#comment-16803370 ] 

John Roesler commented on KAFKA-8165:
-------------------------------------

Thanks for the report, [~xmar].

It sounds like the app ran low on memory, and started experiencing long GC pauses, which disrupted its connection to the broker, and possibly also caused it to time out and undergo a rebalance. Then, of course, it actually did fully run out of memory and crash.

As for the cause, it's extremely hard to say with this information only. I do recall fixing a few memory leak / memory pressure problems in Streams since the 2.0 time frame, but IIRC they were actually relatively hard to encounter "in the wild".

For next steps, I'd recommend enabling GC logs (it's kind of nice to enable these for every java application you run), which can let you notice when the memory pressure starts, not just when it rises to critical levels. You can also tell Java to do a heap dump on crash, which you could then analyze to try and reason about what is needing so much memory. If your heap is extremely large, though, it may not be possible or practical to take the heap dump or handle and analyze it.

It's heavily dependent on what exactly your application is doing, but in the abstract, I don't think that 160 records/sec is a terribly concerning load for Streams. It sounds to me more likely that (as you say), there's a memory leak somewhere, or maybe the application just needs a bigger heap than what you've currently provided.

Does this help?

> Streams task causes Out Of Memory after connection issues and store restoration
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-8165
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8165
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.1.0
>         Environment: Kafka 2.1, Kafka Streams 2.1
> Amazon Linux, on Docker based on wurstmeister/kafka image
>            Reporter: Di Campo
>            Priority: Major
>
> Having a Kafka Streams 2.1 application, when Kafka brokers are stable, the (largely stateful) application has been consuming ~160 messages per second at a sustained rate for several hours. 
> However it started having connection issues to the brokers. 
> {code:java}
> Connection to node 3 (/172.31.36.118:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient){code}
> Also it began showing a lot of these errors: 
> {code:java}
> WARN [Consumer clientId=stream-processor-81e1ce17-1765-49f8-9b44-117f983a2d19-StreamThread-2-consumer, groupId=stream-processor] 1 partitions have leader brokers without a matching listener, including [broker-2-health-check-0] (org.apache.kafka.clients.NetworkClient){code}
> In fact, the _health-check_ topic is in the broker but not consumed by this topology or used in any way by the Streams application (it is just broker healthcheck). It does not complain about topics that are actually consumed by the topology. 
> Some time after these errors (that appear at a rate of 24 appearances per second during ~5 minutes), then the following logs appear: 
> {code:java}
> [2019-03-27 15:14:47,709] WARN [Consumer clientId=stream-processor-81e1ce17-1765-49f8-9b44-117f983a2d19-StreamThread-1-restore-consumer, groupId=] Connection to node -3 (/ip3:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient){code}
> In between 6 and then 3 lines of "Connection could not be established" error messages, 3 of these ones slipped in: 
> [2019-03-27 15:14:47,723] WARN Started Restoration of visitorCustomerStore partition 15 total records to be restored 17 (com.divvit.dp.streams.applications.monitors.ConsoleGlobalRestoreListener)
>  
> ... one for each different KV store I have (I still have another KV that does not appear, and a WindowedStore store that also does not appear). 
> Then I finally see "Restoration Complete" (using a logging ConsoleGlobalRestoreListener as in docs) messages for all of my stores. So it seems it may be fine now to restart the processing.
> Three minutes later, some events get processed, and I see an OOM error:  
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>  
> ... so given that it usually allows to process during hours under same circumstances, I'm wondering whether there is some memory leak in the connection resources or somewhere in the handling of this scenario.
> Kafka and KafkaStreams 2.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)