You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Eugene Koontz <ek...@siftscience.com> on 2017/05/18 23:32:04 UTC
Many "Shrinking ISR for partition" all at once on a single broker

Hello Kafka Users,

We experienced an issue recently where a single Kafka broker reported that,
for all of the topic-partitions for which it was leader:

[2017-05-16 22:48:47,760] INFO Partition [<TOPIC>,<PARTITION>] on broker
561: Shrinking ISR for partition [<TOPIC>,<PARTITION>] from <B1>,<B2>,561
to 561 (kafka.cluster.Partition)

This message occurred for 128 separate partition-topic combinations very
rapidly - from the first of these messages to the last is only 200ms in the
timestamp of the log message.

At the same time, we noticed a spike in the metrics
kafka.network.RequestChannel.{RequestQueueSize,ReponseQueueSize}

The RequestQueueSize went from its usual value of 0 or 1 up to 16, and the
ResponseQueueSize jumped from the typical 0 or 1 all the way to 127 (it's
interesting that this quantity 127 is one less than the total number of
partitions (128) for which the broker decided it was the sole ISR). These
queues stayed at this constant, elevated level until the broker was
manually restarted about an hour later.

There were no signs of networking problems in dmesg or /var/log/syslog,
although there was a dhclient message a couple minutes before the incident:

May 16 22:47:48 production-kafka-3b dhclient: DHCPREQUEST of 10.0.130.239
on eth0 to 10.0.130.1 port 67 (xid=0x40dd7b45)

But these messages are frequent (approximately 30 minutes) and have never
been a cause of Kafka problems in the past, so I doubt any relation.

The next minute after these messages, on the controller broker (a different
broker from the one mentioned above), we saw:

[2017-05-16 22:49:20,884] WARN [Controller-721-to-broker-561-send-thread],
Controller 721 epoch 34 fails to send request
{controller_id=721,controller_epoch=34,partition_states=[{topic=<TOPIC>,partition=34,controller_epoch=33,leader=561,leader_epoch=149,isr=[561],zk_version=329,replicas=[561,254,267]},...rack=us-east-1c}]}
to broker production-kafka-3b:9092 (id: 561 rack: null). Reconnecting to
broker. (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 561 was disconnected before the response
was read

So the broker production-kafka-3b was having trouble both syncing with its
followers, according to its own logs, and also the controller was unable to
communicate with it, as the controller's logs show. However, the host
itself did not seem to have general networking problems - we could ssh to
the host without any perceptible latency. Therefore the problem seems to
lie somewhere in Kafka's handling of network resources.

I found https://issues.apache.org/jira/browse/KAFKA-4477, which sounds very
similar to our situation, but it said to be fixed in 0.10.1.1, which is the
version we have been running for several months up to the present, and
indeed up until recently we never saw this issue. I am also investigating
whether https://issues.apache.org/jira/browse/KAFKA-5153 and
https://issues.apache.org/jira/browse/KAFKA-5007 are relevant in our
situation. I regret not getting a jstack and lsof for the process before we
killed it, but will try to do so next time.

Has anyone experienced this issue? Can anything be done operationally to
mitigate this problem other than restarting the affected broker?

Thanks,
Eugene