You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Toby Bristow <to...@jazznetworks.com.INVALID> on 2020/08/18 17:19:18 UTC

CPU usage spikes and replication failure

Hi,

Running Kafka 2.5.0_2.12.

We run 3 identically configured kafka clusters (one in each region) and on
one of them we've recently started having regular issues with spikes in CPU
utilization lasting a few hours. In some cases we see large numbers of what
look like replication failures, with the logs filled with:

[2020-08-06 04:05:53,766] ERROR [ReplicaManager broker=0] Error
processing append operation on partition __consumer_offsets-20
(kafka.server.ReplicaManager)org.apache.kafka.common.errors.NotEnoughReplicasException:
The size of the current ISR Set(0) is insufficient to satisfy the
min.isr requirement of 2 for partition __consumer_offsets-20

Once in this state the other brokers in the cluster also begin to fail
with similar messages and our producers being to error.

We've found that if we turn off all our producers temporarily the
cluster quickly returns to normal, which indicates some additional
load is being placed on the system, however all of our metrics show no
increased message rates.

Each cluster has 3 brokers, each with 4 CPU and 4 GiB memory, under
normal load we see 1000-5000 message/s being produced, and we see
maybe 10% utilization (the intention is to load this cluster more in
future).

During these periods of high utilization though we see CPU maxing out
at 100% utilization for long periods of time.

Searching around previous issues, I noticed
https://issues.apache.org/jira/browse/KAFKA-4477 and
https://issues.apache.org/jira/browse/KAFKA-6582 sounded similar,
however both marked as fixed long before this release.

I've attached a thread dump of one of the brokers during a high load
period. It looks to my untrained eye like everything is just waiting
on locks.

Any assistance in further diagnosing this issue would be really
appreciated. Our next step will probably be to completely rebuild this
cluster, as we've only ever seen these issues on one of our instances.

Regards,

Toby