You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Lukas Lalinsky <lu...@exponea.com> on 2017/09/15 14:30:03 UTC
Kafka 0.11 broker running out of file descriptors

Hello,

I'm dealing with a strange issue in production and I'm running out of
options what to do about it.

It's a 3 node cluster running Kafka 0.11.0.1 with most topics having
replication factor of 2. At some point, the broker that is about do die
shrinks ISR for a few partitions just to itself:

[2017-09-15 11:25:29,104] INFO Partition [...,12] on broker 3: Shrinking
ISR from 3,2 to 3 (kafka.cluster.Partition)
[2017-09-15 11:25:29,107] INFO Partition [...,8] on broker 3: Shrinking ISR
from 3,1 to 3 (kafka.cluster.Partition)
[2017-09-15 11:25:29,108] INFO Partition [...,38] on broker 3: Shrinking
ISR from 3,2 to 3 (kafka.cluster.Partition)

Then slightly after that, another broker writes errors like this to the log
file:

[2017-09-15 11:25:45,536] WARN [ReplicaFetcherThread-0-3]: Error in fetch
to broker 3, request (type=FetchRequest, replicaId=2, maxWait=500,
minBytes=1, maxBytes=10485760, fetchData={...})
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response
was read
        at
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

There are many of such messages. At that point, I see the number of open
file descriptors on the other broker growing. And eventually it crashes
with thousands of messages like this:

[2017-09-15 11:31:23,273] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
        at kafka.network.Acceptor.accept(SocketServer.scala:337)
        at kafka.network.Acceptor.run(SocketServer.scala:280)
        at java.lang.Thread.run(Thread.java:745)

The file descriptor limit is set to 128k, the number of open file
descriptors during normal operation is about 8k, so there is a lot of
headroom.

I'm not sure if it's the other brokers trying to replicate that kills it,
or whether it's clients trying to publish messages.

Has anyone seen a behavior like this? I'd appreciate any pointers.

Thanks,

Lukas