You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Robin Tweedie (JIRA)" <ji...@apache.org> on 2018/02/19 15:59:00 UTC
[jira] [Comment Edited] (KAFKA-6199) Single broker with fast growing heap usage

    [ https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369257#comment-16369257 ] 

Robin Tweedie edited comment on KAFKA-6199 at 2/19/18 3:58 PM:
---------------------------------------------------------------

There's another piece of evidence I've noticed about this single broker: it builds up file descriptors in a way that the other brokers don't. I'm not sure if this narrows the potential causes much.

On the problem broker, check out the high {{sock}} open files:
{noformat}
$ sudo lsof | awk '{print $5}' | sort | uniq -c | sort -rn
  12201 REG
   7229 IPv6
   1374 sock
    337 FIFO
    264 DIR
    163 CHR
    138 0000
     77 unknown
     54 unix
     13 IPv4
      1 TYPE
      1 pack
{noformat}

If you look at the {{lsof}} output directory there are lots of lines like this (25305 is the Kafka pid)
{noformat}
java      25305 user *105u     sock                0,6       0t0  351061533 can't identify protocol
java      25305 user *111u     sock                0,6       0t0  351219556 can't identify protocol
java      25305 user *131u     sock                0,6       0t0  350831689 can't identify protocol
java      25305 user *134u     sock                0,6       0t0  351001514 can't identify protocol
java      25305 user *136u     sock                0,6       0t0  351410956 can't identify protocol
{noformat}

Compare with a good broker that has an uptime of 76 days (only 65 open {{sock}} files):
{noformat}
  11729 REG
   7037 IPv6
    335 FIFO
    264 DIR
    164 CHR
    137 0000
     76 unknown
     65 sock
     54 unix
     14 IPv4
      1 TYPE
      1 pack
{noformat}

It has the same kind of {{lsof}} output, but much less.


was (Author: rt_skyscanner):
There's another piece of evidence I've noticed about this single broker: it builds up file descriptors in a way that the other brokers don't. I'm not sure if this narrows the potential causes much.

On the problem broker, check out the high {{sock}} open files:
{noformat}
$ sudo lsof | awk '{print $5}' | sort | uniq -c | sort -rn
  12201 REG
   7229 IPv6
   1374 sock
    337 FIFO
    264 DIR
    163 CHR
    138 0000
     77 unknown
     54 unix
     13 IPv4
      1 TYPE
      1 pack
{noformat}

If you look at the {{lsof}} output directory there are lots of lines like this (25305 is the Kafka pid)
{noformat}
java      25305 user *105u     sock                0,6       0t0  351061533 can't identify protocol
java      25305 user *111u     sock                0,6       0t0  351219556 can't identify protocol
java      25305 user *131u     sock                0,6       0t0  350831689 can't identify protocol
java      25305 user *134u     sock                0,6       0t0  351001514 can't identify protocol
java      25305 user *136u     sock                0,6       0t0  351410956 can't identify protocol
{noformat}

Compare with a good broker that has an uptime of 76 days (only 65 open {{sock}} files):
{noformat}
  11729 REG
   7037 IPv6
    335 FIFO
    264 DIR
    164 CHR
    137 0000
     76 unknown
     65 sock
     54 unix
     14 IPv4
      1 TYPE
      1 pack
{noformat}

> Single broker with fast growing heap usage
> ------------------------------------------
>
>                 Key: KAFKA-6199
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6199
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>         Environment: Amazon Linux
>            Reporter: Robin Tweedie
>            Priority: Major
>         Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage which necessitates us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from long GC pauses and eventually has {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage percentage on the broker. A "normal" broker in the same cluster stays below 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously high, and there are a lot of retained {{NetworkSend}} objects referencing byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log message visualized across all the brokers (to show it happens a bit on other brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin down the problem better. We're currently trying to narrow it down to a particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source of this leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)